Randomly selecting records from a large data set may be helpful if your data set is so large as to prevent slow processing, or if one is conducting a survey and needs to select a random sample from some master database. When you select records randomly from a larger data set (or some master database), you can achieve the sampling in a few different ways, including:
- sampling without replacement, in which a subset of the observations is selected randomly, and once an observation is selected it cannot be selected again.
- sampling with replacement, in which a subset of observations are selected randomly, and an observation may be selected more than once.
- selecting a stratified sample, in which a subset of observations are selected randomly from each group of the observations defined by the value of a stratifying variable, and once an observation is selected it cannot be selected again.
In this section, we'll investigate sampling without replacement. Then, in the next two sections, we'll investigate sampling with replacement and selecting a stratified sample. Throughout the three sections, we'll work with a contrived mailing list. We'll use the list under the guise of being a large catalog mail-order company wanting to conduct a random survey of a subset of our customers. The actual list we'll use is admittedly (much) smaller than what we would be working with in practice. Our teeny-tiny mailing list is, of course, used merely for the purpose of illustrating some random sampling techniques in SAS.
Example 34.1 Section
The mailing list with which we will be working is contained in a permanent SAS data set called mailing. The following SAS code simply prints the mailing list:
OPTIONS PS = 58 LS = 80 NODATE NONUMBER;
LIBNAME stat482 'C:\*InsertYourDriveName*\Stat482wc\sasndata';
PROC PRINT data=stat482.mailing;
title 'Sample Dataset: Mailing List';
RUN;
Obs | Num | Name | Street | City | State |
---|---|---|---|---|---|
1 | 1 | Jonathon Smothers | 103 Oak Lane | Bellefonte | PA |
2 | 2 | Jane Doe | 845 Main Street | Bellefonte | PA |
3 | 3 | Jim Jefferson | 10101 Allegheny Street | Bellefonte | PA |
4 | 4 | Mark Adams | 312 Oak Lane | Bellefonte | PA |
5 | 5 | Lisa Brothers | 89 Elm Street | Bellefonte | PA |
6 | 6 | Delilah Fequa | 2094 Acorn Street | Bellefonte | PA |
7 | 7 | John Doe | 812 Main Street | Bellefonte | PA |
8 | 8 | Mamie Davison | 102 Cherry Avenue | Bellefonte | PA |
9 | 9 | Ernest Smith | 492 Main Street | Bellefonte | PA |
10 | 10 | Laura Mills | 704 Hill Street | Bellefonte | PA |
11 | 11 | Linda Bentlager | 1010 Tricia Lane | Bellefonte | PA |
12 | 12 | Fran Cipolla | 912 Cardinal Drive | Bellefonte | PA |
13 | 13 | James Whitney | 104 Pine Hill Drive | Bellefonte | PA |
14 | 14 | William Edwards | 79 Oak Lane | Bellefonte | PA |
15 | 15 | Harold Harvey | 480 Main Street | Bellefonte | PA |
16 | 38 | Miriam Denders | 2348 Robin Avenue | Port Matilda | PA |
17 | 39 | Scott Fitzgerald | 43 Blue Jay Drive | Port Matilda | PA |
18 | 40 | Jane Smiley | 298 Cardinal Drive | Port Matilda | PA |
19 | 41 | Lou Barr | 219 Eagle Street | Port Matilda | PA |
20 | 42 | Casey Spears | 123 Main Street | Port Matilda | PA |
21 | 43 | Leslie Olin | 487 Bluebird Haven | Port Matilda | PA |
22 | 44 | Edwin Hoch | 389 Dolphin Drive | Port Matilda | PA |
23 | 45 | Ann Draper | 72 Lake Road | Port Matilda | PA |
24 | 46 | Linda Nicolson | 71 Liberty Terrace | Port Matilda | PA |
25 | 47 | Barb Wyse | 21 Cleveland Drive | Port Matilda | PA |
26 | 48 | Coach Pierce | 74 Main Street | Port Matilda | PA |
27 | 49 | Tim Winters | 95 Dove Street | Port Matilda | PA |
28 | 50 | George Matre | 75 Ashwind Drive | Port Matilda | PA |
29 | 16 | Linda Edmonds | 410 College Avenue | State College | PA |
30 | 17 | Rigna Patel | 101 Beaver Avenue | State College | PA |
31 | 18 | Ade Fequa | 803 Allen Street | State College | PA |
32 | 19 | Frank Smith | 238 Waupelani Drive | State College | PA |
33 | 20 | Kristin Jones | 120 Stratford Drive | State College | PA |
34 | 21 | Amy Kuntz | 357 Park Avenue | State College | PA |
35 | 22 | Roberta Kudla | 312 Whitehall Road | State College | PA |
36 | 23 | Greg Pope | 5100 No. Atherton | State College | PA |
37 | 24 | Mark Mendel | 256 Fraser Street | State College | PA |
38 | 25 | Steve Lindhoff | 130 E. College Avenue | State College | PA |
39 | 26 | Jan Davison | 201 E. Beaver Avenue | State College | PA |
40 | 27 | Lucy Arnets | 345 E. College Avenue | State College | PA |
41 | 28 | Srabashi Kundu | 112 E. Beaver Avenue | State College | PA |
42 | 29 | Joe White | 678 S. Allen Street | State College | PA |
43 | 30 | Daniel Peterson | 328 Waupelani Drive | State College | PA |
44 | 31 | Robert Williams | 156 Straford Drive | State College | PA |
45 | 32 | George Ball | 888 Park Avenue | State College | PA |
46 | 33 | Steve Ignella | 367 Whitehall Road | State College | PA |
47 | 34 | Mike Dahlberg | 1201 No. Atherton | State College | PA |
48 | 35 | Doris Alcorn | 453 Fraser Street | State College | PA |
49 | 36 | Daniel Fremgen | 103 W. College Avenue | State College | PA |
50 | 37 | Scott Henderson | 245 W. Beaver Avenue | State College | PA |
First, click the mailing data set in order to save the data set to a convenient location on your computer. Then, after you launch the SAS program, edit the LIBNAME statement so that it reflects the location in which you saved the data set. Run the program and review the resulting output in order to familiarize yourself with the data set.
Approximate-Sized Samples Section
When using a computer program, such as SAS, to randomly select a subset of observations from some larger data set, there are two approaches we can take. We could tell SAS to randomly select a percentage, say 30%, of the observations in the data set. Or, we could tell SAS to randomly select an exact number, say 25, of the observations in the data set. With the former approach, we cannot be guaranteed that the subset data set will achieve a specific size. We consider such samples as an "approximate-sized sample." In general, to obtain an approximate-sized sample, one selects k% of the observations from the original data set.
Example 34.2 Section
The following program illustrates how to use a SAS data step to obtain an approximate-sized random sample without replacement. Specifically, the program uses the ranuni function and a WHERE statement to tell SAS to randomly sample approximately 30% of the 50 observations from the permanent SAS data set mailing:
DATA sample1A (where = (random le 0.30));
set stat482.mailing;
random = ranuni(43420);
RUN;
PROC PRINT data=sample1A NOOBS;
title1 'Sample1A: Approximate-Sized Simple Random Sample';
title2 'without Replacement';
RUN;
Num | Name | Street | City | State | random |
---|---|---|---|---|---|
1 | Jonathon Smothers | 103 Oak Lane | Bellefonte | PA | 0.07478 |
2 | Jane Doe | 845 Main Street | Bellefonte | PA | 0.25203 |
4 | Mark Adams | 312 Oak Lane | Bellefonte | PA | 0.08918 |
6 | Delilah Fequa | 2094 Acorn Street | Bellefonte | PA | 0.02253 |
7 | John Doe | 812 Main Street | Bellefonte | PA | 0.15570 |
8 | Mamie Davison | 102 Cherry Avenue | Bellefonte | PA | 0.05460 |
9 | Ernest Smith | 492 Main Street | Bellefonte | PA | 0.05662 |
14 | William Edwards | 79 Oak Lane | Bellefonte | PA | 0.15432 |
38 | Miriam Denders | 2348 Robin Avenue | Port Matilda | PA | 0.16192 |
41 | Lou Barr | 219 Eagle Street | Port Matilda | PA | 0.13033 |
43 | Leslie Olin | 487 Bluebird Haven | Port Matilda | PA | 0.23101 |
44 | Edwin Hoch | 389 Dolphin Drive | Port Matilda | PA | 0.20708 |
49 | Tim Winters | 95 Dove Street | Port Matilda | PA | 0.03722 |
20 | Kristin Jones | 120 Stratford Drive | State College | PA | 0.29425 |
22 | Roberta Kudla | 312 Whitehall Road | State College | PA | 0.05187 |
24 | Mark Mendel | 256 Fraser Street | State College | PA | 0.06246 |
26 | Jan Davison | 201 E. Beaver Avenue | State College | PA | 0.00799 |
31 | Robert Williams | 156 Straford Drive | State College | PA | 0.14537 |
34 | Mike Dahlberg | 1201 No. Atherton | State College | PA | 0.27246 |
35 | Doris Alcorn | 453 Fraser Street | State College | PA | 0.24231 |
Launch and run the SAS program. Then, review the resulting output to see the random sample that SAS selected from the mailing data set. You should note a couple of things. First, the people that appear in the random sample appear to be fairly uniformly distributed across the 50 possible Num values. Also, the final random sample contains 20 of the 50 observations in the mailing data set. At 40% (20 out of 50), this is a little higher than the 30% sample we were asking for, but it should not be surprising as it is an artifact of the method used. Finally, note that the variable random contains only values that are smaller than 0.30, as should be expected in light of the WHERE= option attached to the DATA statement.
Okay, now how does the program work? Before we answer the question, note that the technique we use is a technique commonly used by statisticians. It will work in any program, not just SAS. Now, for the answer ... the random assignment statement tells SAS to use the ranuni function to generate a (pseudo) random number between 0 and 1 and to assign the resulting number to a variable called random. The number 43420 that appears in the parentheses of the ranuni function is specified by the user and is called the seed. In general:
- The seed must be a non-negative number less than 2,147,483,647.
- A given seed always produces the same results. That is, using the same seed, the ranuni function would select the same observations.
- If you choose 0 as the seed, then the computer clock time at execution is used. In this case, it is very unlikely that the ranuni function would produce the same results. It should be noted, that it is common practice when conducting research to use a non-zero seed so that the results could be reproduced if necessary.
- The ranuni function can be used without assigning it to another variable. We assigned the value to the variable called random just so we could print the results.
Now, because the numbers generated by the ranuni function are uniformly distributed across the numbers between 0 and 1, we should expect about 30% of the random numbers to be less than 0.30. That's where the WHERE= option on the DATA statement comes into play. If the random number generated is less than or equal to 0.30, then the observation is selected for inclusion in the sample. Since the mailing data set has 50 observations, about 30% of the observations should be selected to create a sample of approximately 15 people. Because the selection depends on the values of the numbers generated, the sample cannot be guaranteed to be of a certain size.
You might want to change the seed a few times to see how it affects the sample. If you use seed 1, for example, you'll see that the new random sample contains 15 observations, not 20 as in our first sample. You might also want to change the proportion 0.30 to various other numbers between 0 and 1 to see how it affects the size of the sample.
Example 34.3 Section
The following code illustrates an alternative way of randomly selecting an approximate-sized random sample without replacement. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample approximately 30% of the 50 observations from the permanent SAS data set mailing:
PROC SURVEYSELECT data = stat482.mailing
out = sample1B
method = SRS
seed = 12345678
samprate = 0.30;
title;
RUN;
PROC PRINT data = sample1B NOOBS;
title1 'Sample1B: Approximate-Sized Simple Random Sample';
title2 'without Replacement (using PROC SURVEYSELECT)';
RUN;
Num | Name | Street | City | State |
---|---|---|---|---|
1 | Jonathon Smothers | 103 Oak Lane | Bellefonte | PA |
5 | Lisa Brothers | 89 Elm Street | Bellefonte | PA |
12 | Fran Cipolla | 912 Cardinal Drive | Bellefonte | PA |
14 | William Edwards | 79 Oak Lane | Bellefonte | PA |
38 | Miriam Denders | 2348 Robin Avenue | Port Matilda | PA |
39 | Scott Fitzgerald | 43 Blue Jay Drive | Port Matilda | PA |
40 | Jane Smiley | 298 Cardinal Drive | Port Matilda | PA |
44 | Edwin Hoch | 389 Dolphin Drive | Port Matilda | PA |
45 | Ann Draper | 72 Lake Road | Port Matilda | PA |
50 | George Matre | 75 Ashwind Drive | Port Matilda | PA |
19 | Frank Smith | 238 Waupelani Drive | State College | PA |
24 | Mark Mendel | 256 Fraser Street | State College | PA |
29 | Joe White | 678 S. Allen Street | State College | PA |
34 | Mike Dahlberg | 1201 No. Atherton | State College | PA |
35 | Doris Alcorn | 453 Fraser Street | State College | PA |
Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select a sample from the mailing data set. As you can see, the SURVEYSELECT procedure produces one page of output that is merely informational, reiterating much of the information that we supplied to SAS in our SURVEYSELECT code:
- The DATA= option tells SAS the name of the input data set (stat482.mailing) from which observations should be selected.
- The OUT= option tells SAS the name of the output data set (sample1B) in which the selected observations should be stored.
- The METHOD= option tells SAS the sampling method that should be used. Here, SRS tells SAS to use the simple random sampling method to select observations, that is, with equal probability and without replacement.
- The SEED= option tells SAS the initial seed (12345678) for generating the random number. In general, the value of the SEED= option must be an integer, and if you do not specify the SEED= option, or if the SEED= value is negative or zero, the computer's clock is used to obtain the initial seed.
- The SAMPRATE= option tells SAS what proportion (0.30) of the input data set should be sampled.
Oh, and the empty title statement that appears in the code is there merely to minimize any confusion its absence may cause you. If it, or another TITLE statement, is not present, the first (informational) page of the SURVEYSELECT output will contain the most recent title, which in this case would concern Sample1A from the previous example. Now that would be confusing!
Exact-Sized Samples Section
Thus far, we've produced only approximate-sized random samples without replacement. Now, we'll turn our attention to three examples that illustrate how to produce exact-sized random samples without replacement. We'll start (naturally?!) with the most complicated procedure first (using a DATA step) and end up with the most straightforward procedure last (using the SURVEYSELECT procedure).
Example 34.4 Section
The following program illustrates how to use a SAS data step to obtain an exact-sized random sample without replacement. Specifically, the program uses the ranuni function in a DATA step to tell SAS to randomly sample exactly 15 of the 50 observations from the permanent SAS data set mailing:
DATA sample2;
set stat482.mailing nobs=total;
if _N_ = 1 then n=total;
retain k 15 n;
random = ranuni(860244);
propn = k/n;
if random le propn then
do;
output;
k=k-1;
end;
n=n-1;
if k=0 then stop;
RUN;
PROC PRINT data=sample2 NOOBS;
title1 'Sample2: Exact-Sized Simple Random Sample';
title2 'without Replacement';
var num name n k random propn;
RUN;
Num | Name | n | k | random | propn |
---|---|---|---|---|---|
4 | Mark Adams | 47 | 15 | 0.12829 | 0.31915 |
5 | Lisa Brothers | 46 | 14 | 0.08799 | 0.30435 |
6 | Delilah Fequa | 45 | 13 | 0.02446 | 0.28889 |
9 | Ernest Smith | 42 | 12 | 0.01228 | 0.28571 |
14 | William Edwards | 37 | 11 | 0.12908 | 0.29730 |
15 | Harold Harvey | 36 | 10 | 0.03136 | 0.27778 |
41 | Lou Barr | 32 | 9 | 0.11230 | 0.28125 |
46 | Linda Nicolson | 27 | 8 | 0.10826 | 0.29630 |
16 | Linda Edmonds | 22 | 7 | 0.26260 | 0.31818 |
23 | Greg Pope | 15 | 6 | 0.17021 | 0.40000 |
25 | Steve Lindhoff | 13 | 5 | 0.36375 | 0.38462 |
28 | Srabashi Kundu | 10 | 4 | 0.08095 | 0.40000 |
33 | Steve Ignella | 5 | 3 | 0.56556 | 0.60000 |
34 | Mike Dahlberg | 4 | 2 | 0.35489 | 0.50000 |
36 | Daniel Fremgen | 2 | 1 | 0.14088 | 0.50000 |
Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select a sample of 15 observations from the mailing data set.
In summary, here's the approach used to select the sample:
- For each observation in the data set, generate a uniform random number.
- Select the first observation in the original data set for inclusion in the sample if its random number is less than or equal to the proportion of records needed (15 of 50, or 0.30).
- Modify the proportion still needed in the sample. Here, it is 14/49 if the first observation was selected for the sample; and it is 15/49 if it was not. If the random number generated for the second observation is less than or equal to this proportion, include it in the sample.
- Continue this process until you have selected exactly 15 observations.
Now, how to accomplish this approach using the SAS DATA step? Here's how we did it step-by-step:
k = the number of observations needed to complete the sample.
n = the number of observations left to read from the original data set.
- Define two variables k and n, where:
- Using the NOBS= option of the SET statement, determine the number of observations in the stat482 mailing data set and assign the value to a variable called total. In general, the NOBS= option creates and names a temporary variable whose value is the total number of observations in the data set specified in the SET statement.
- For the first observation, that is, when the automatic variable _N_ equa1s 1, set the variable n to the value of the variable total (here, 50). (Recall that automatic variables are created automatically by the DATA step, are added to the program data vector, but are not output to the data set being created. The values of automatic variables are retained from one iteration of the DATA step to the next, rather than set to missing. The automatic variable _N_ is initially set to 1. Each time the DATA step loops past the DATA statement, the variable _N_ increments by 1. The value of _N_ represents the number of times the DATA step has iterated, and often equals the number of observations in the output data set.)
- Using the RETAIN statement, initialize k to 15, the number of observations desired in the final sample.
- Use the ranuni function (starting with a seed of 860244) to generate a uniform random number between 0 and 1. Use k and n to determine the proportion of observations that still needs to be selected from the mailing data set.
- If the random number generated is less than the proportion of observations still needed, then OUTPUT the observation to the output data set. When an observation is selected, reduce the number of observations still needed in the sample by 1 (that is, k = k-1).
- At the end of each iteration of the DATA step:
- reduce the number of observations left in the mailing data set by 1 (n = n - 1)
- determine if the sample is complete (is k = 0?). If yes, tell SAS to STOP. In general, the STOP statement tells SAS to stop processing the current DATA step immediately and resume processing statements after the end of the current DATA step.
Note that the random = ranuni( ) and propn = k/n assignments are made here only so their values can be printed. In another situation, these values would be incorporated directly in the IF statement: if ranuni( ) le k/n then do; Additionally, k and n could be dropped from the output data set, but are kept here, so their values can be printed for educational purposes.
Example 34.5 Section
The following code illustrates an alternative way of using a DATA step to randomly select an exact-sized random sample without replacement. The code, while less efficient — because it requires that the data set be processed twice and sorted once — may feel more natural and intuitive to you:
DATA sample3A;
set stat482.mailing;
random=ranuni(860244);
RUN;
PROC SORT data=sample3A;
by random;
RUN;
DATA sample3A;
set sample3A;
if _N_ le 15;
RUN;
PROC PRINT data=sample3A;
title1 'Sample3A: Exact-Sized Simple Random Sample';
title2 'without Replacement';
RUN;
Obs | Num | Name | Street | City | State | random |
---|---|---|---|---|---|---|
1 | 9 | Ernest Smith | 492 Main Street | Bellefonte | PA | 0.01228 |
2 | 6 | Delilah Fequa | 2094 Acorn Street | Bellefonte | PA | 0.02446 |
3 | 15 | Harold Harvey | 480 Main Street | Bellefonte | PA | 0.03136 |
4 | 28 | Srabashi Kundu | 112 E. Beaver Avenue | State College | PA | 0.08095 |
5 | 5 | Lisa Brothers | 89 Elm Street | Bellefonte | PA | 0.08799 |
6 | 46 | Linda Nicolson | 71 Liberty Terrace | Port Matilda | PA | 0.10826 |
7 | 41 | Lou Barr | 219 Eagle Street | Port Matilda | PA | 0.11230 |
8 | 4 | Mark Adams | 312 Oak Lane | Bellefonte | PA | 0.12829 |
9 | 14 | William Edwards | 79 Oak Lane | Bellefonte | PA | 0.12908 |
10 | 36 | Daniel Fremgen | 103 W. College Avenue | State College | PA | 0.14088 |
11 | 23 | Greg Pope | 5100 No. Atherton | State College | PA | 0.17021 |
12 | 16 | Linda Edmonds | 410 College Avenue | State College | PA | 0.26260 |
13 | 38 | Miriam Denders | 2348 Robin Avenue | Port Matilda | PA | 0.32450 |
14 | 13 | James Whitney | 104 Pine Hill Drive | Bellefonte | PA | 0.33555 |
15 | 34 | Mike Dahlberg | 1201 No. Atherton | State College | PA | 0.35489 |
Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select a sample of 15 observations from the mailing data set.
The approach used is very similar to the approach used previously for selecting an approximate-sized sample without replacement. That is:
- For each observation in the data set, use the ranuni function to generate a uniform random number and store it in the variable called random.
- Sort the data set by the random number random.
- Select the first 15 observations from the sorted data set using the automatic variable _N_ (if _N_ le 15).
By so doing, every observation in the mailing data set has an equal likelihood of being one of the first 15 observations, and therefore an equal likelihood of being selected into the sample.
Example 34.6 Section
The following code illustrates yet another alternative way of randomly selecting an exact-sized random sample without replacement. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample exactly 15 of the 50 observations from the permanent SAS data set mailing:
PROC SURVEYSELECT data = stat482.mailing
out = sample3B
method = SRS
seed = 12345678
sampsize = 15;
title;
RUN;
PROC PRINT data = sample3B;
title1 'Sample3B: Exact-Sized Simple Random Sample';
title2 'without Replacement (using PROC SURVEYSELECT)';
RUN;
Obs | Num | Name | Street | City | State |
---|---|---|---|---|---|
1 | 1 | Jonathon Smothers | 103 Oak Lane | Bellefonte | PA |
2 | 5 | Lisa Brothers | 89 Elm Street | Bellefonte | PA |
3 | 12 | Fran Cipolla | 912 Cardinal Drive | Bellefonte | PA |
4 | 14 | William Edwards | 79 Oak Lane | Bellefonte | PA |
5 | 38 | Miriam Denders | 2348 Robin Avenue | Port Matilda | PA |
6 | 39 | Scott Fitzgerald | 43 Blue Jay Drive | Port Matilda | PA |
7 | 40 | Jane Smiley | 298 Cardinal Drive | Port Matilda | PA |
8 | 44 | Edwin Hoch | 389 Dolphin Drive | Port Matilda | PA |
9 | 45 | Ann Draper | 72 Lake Road | Port Matilda | PA |
10 | 50 | George Matre | 75 Ashwind Drive | Port Matilda | PA |
11 | 19 | Frank Smith | 238 Waupelani Drive | State College | PA |
12 | 24 | Mark Mendel | 256 Fraser Street | State College | PA |
13 | 29 | Joe White | 678 S. Allen Street | State College | PA |
14 | 34 | Mike Dahlberg | 1201 No. Atherton | State College | PA |
15 | 35 | Doris Alcorn | 453 Fraser Street | State College | PA |
Launch and run the SAS program. Then, review the resulting output to convince yourself that the code did indeed select a sample of 15 observations from the mailing data set. Note that the only difference between this code and the previous SURVEYSELECT code is the sampsize = 15 statement here replaces the samprate = 0.30 statement there. You might want to change the seed (seed) and sample size (sampsize) a few times to see how it affects the sample.