# 34.3 - Stratified Random Sampling

34.3 - Stratified Random SamplingIn the two previous sections, we were concerned with taking a random sample from a data set without regard to whether an observation comes from a particular subgroup. When you are conducting a survey, it often behooves you to make sure that your sample contains a certain number of observations from each particular subgroup. We'll concern ourselves with such a restriction here. That is, in this section, we'll focus on ways of using SAS to obtain a **stratified random sample**, in which a subset of observations are selected randomly from each subgroup of observations as determined by the value of a stratifying variable. We'll also go back to sampling without replacement, in which once an observation is selected it cannot be selected again.

## Selecting a Stratified Sample of Equal-Sized Groups

We'll first focus on the situation in which an equal number of observations are selected from each subgroup of observations as determined by the value of a stratifying variable.

## Example 34.9

The following code illustrates how to select a **stratified random sample of equal-sized groups**. Specifically, the code tells SAS to randomly select 5 observations from each of the three subgroups — State College, Port Matilda, Bellefonte — as determined by the value of the variable *city*:

` ````
PROC FREQ data=stat482.mailing;
table city/out=bycount noprint;
RUN;
PROC SORT data=stat482.mailing;
by city;
RUN;
DATA sample5;
merge stat482.mailing bycount (drop = percent);
by city;
retain k;
if first.city then k=5;
random = ranuni(109);
propn = k/count;
if random le propn then
do;
output;
k=k-1;
end;
count=count-1;
RUN;
PROC PRINT data=bycount;
title 'Count by CITY';
RUN;
PROC PRINT data=sample5;
title 'Sample5: Stratified Random Sample with Equal-Sized Strata';
by city;
RUN;
```

First, launch and run * * the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the *mailing* data set, five observations from Bellefonte, five observations from Port Matilda, and five observations from State College.

Now, how does the program work? In order to select a stratified random sample in SAS, we basically use code similar to selecting equal-sized random samples without replacement, except now we process within each subgroup. More specifically, here's how the program works step-by-step:

- The sole purpose of the FREQ procedure is to determine the number of observations in the
*stat482.mailing*data set that correspond to each level of the stratification variable*city*(hence, "table*city*"). The OUT = option tells SAS to create a data set called*bycount*that contains the variable*city*and two variables that contain the number (*count*) and percentage (*percent*) of records for each level of*city*. - The SORT procedure merely sorts the
*stat482.mailing*data set by*city*and stores the sorted result in a temporary data set called*mailing*so that it can be processed by*city*in the next DATA step. - Merge, by
*city*, the sorted data set*mailing*with the*bycount*data set, so that the number of observations per subgroup is available. Since the percentage of observations is not needed, drop it from the data set on input. - The rest of the code in the DATA step should look very familiar. That is, once the number of observations per subgroup in the original
*stat482.mailing*data set is available, you can randomly select records from the subgroup as you would select equal-sized random samples without replacement, except you select within*city*(hence, "by*city*"). Every time SAS reads in a new*city*(hence, "if*first.city*"), the number of observations still needed in the subgroup's sample (*k*) is set to the number of observations desired in each of the subgroups (5, here).

Note that, again, the *random *= **ranuni**( ) and *propn = k/n* assignments are made here only so their values can be printed. In another situation, these values would be incorporated directly in the IF statement: if **ranuni**( ) le *k/n* then do; Additionally, *k* and *count* could be dropped from the output data set, but are kept here, so their values can be printed for educational purposes.

## Example 34.10

The following code illustrates an alternative way of randomly selecting a **stratified random sample of equal-sized groups**. The code, while less efficient — because it requires that the data set be processed twice and sorted once — may feel more natural and intuitive to you:

` ````
DATA scollege pmatilda bellefnt;
set stat482.mailing;
if city = 'State College' then output scollege;
else if city = 'Port Matilda' then output pmatilda;
else if city = 'Bellefonte' then output bellefnt;
RUN;
%MACRO select (dsn, num);
DATA &dsn;
set &dsn;
random=ranuni(85329);
RUN;
PROC SORT data=&dsn;
by random;
RUN;
DATA &dsn;
set &dsn;
if _N_ le #
RUN;
%MEND select;
%SELECT(scollege, 5); %SELECT(pmatilda, 5); %SELECT(bellefnt, 5);
DATA sample6A;
set bellefnt pmatilda scollege;
RUN;
PROC PRINT data=sample6A;
title 'Sample6A: Stratified Random Sample with Equal-Sized Strata';
by city;
RUN;
```

First, launch and run * * the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the *mailing* data set, five observations from Bellefonte, five observations from Port Matilda, and five observations from State College.

Now, how does the program work? In summary, here's how this the approach works:

- The first DATA step uses an IF-THEN-ELSE statement in conjunction with OUTPUT statements to divide the original mailing data set up into several data sets based on the value of
*city*. (Here, we create three data sets, one for each*city*... namely,*scollege*,*pmatilda*, and*bellefnt*.) - Then, the macro
*select*exactly mimics the creation of the*sample3A*data set in Example 10.5 on the Random Sampling Without Replacement page. That is, the macro generates a random number for each observation, the data set is sorted by the random number, and then the first*num*observations are selected. - Then, call the macro
*select*three times once for each of the*city*data sets ....*scollege*,*pmatilda*, and*bellefnt*.... selecting five observations from each. - Finally, the final DATA step concatenates the three data sets,
*bellefnt, scollege*, and*pmatilda*, with 5 observations each back into one data set called*sample6A*with the 15 randomly selected observations..

Lo and behold, when all is said and done, we have another stratified random sample of equal-sized groups! Approach #2 checked off. Now, onto one last approach!

## Example 34.11

The following code illustrates yet another alternative way of randomly selecting a **stratified random sample of equal-sized groups**. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample exactly five observations from each of the three *city* subgroups in the permanent SAS data set *mailing*:

` ````
PROC SURVEYSELECT data = stat482.mailing
out = sample6B
method = SRS
seed = 12345678
sampsize = (5 5 5);
strata city notsorted;
title;
RUN;
PROC PRINT data = sample6B;
title1 'Sample6B: Stratified Random Sample';
title2 'with Equal-Sized Strata (using PROC SURVEYSELECT)';
RUN;
```

First, launch and run * * the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the *mailing* data set, five observations from Bellefonte, five observations from Port Matilda, and five observations from State College.

Now, the specifics about the code. The only things that should look new here are the STRATA statement and the form of the SAMPSIZE statement. The STRATA statement tells SAS to partition the input data set *stat482.mailing* into nonoverlapping groups defined by the variable *city*. The NOTSORTED option does not tell SAS that the data set is unsorted. Instead, the NOTSORTED option tells SAS that the observations in the data set are arranged in *city* groups, but the groups are not necessarily in alphabetical order. The SAMPSIZE statement tells SAS that we are interested in sampling five observations from each of the *city* groups.

### Selecting a Stratified Sample of Unequal-Sized Groups

Now, we'll focus on the situation in which an unequal number of observations are selected from each subgroup of observations as determined by the value of a stratifying variable. If there are an unequal number of observations for each subgroup in the original data set, this sampling scheme may be accomplished by selecting the same proportion of observations from each subgroup. Again, we'll sample without replacement, in which once an observation is selected it cannot be re-selected.

To select a **stratified random sample of unequal-sized groups**, we could use the code from Example 10.10 by passing the different group sample sizes into the macro *select*. Alternatively, we could create a data set containing two count variables ...one that contains the number of observations in each subgroup (*n*) ...and the other that contains the number of observations that need to be selected from each subgroup (*k*). Once the data set is created, we could merge it with the original data set, and select observations randomly as we have done previously for random samples without replacement. That's the strategy that the following example uses.

## Selecting a Stratified Sample of Unequal-Sized Groups

Now, we'll focus on the situation in which an unequal number of observations are selected from each subgroup of observations as determined by the value of a stratifying variable. If there are an unequal number of observations for each subgroup in the original data set, this sampling scheme may be accomplished by selecting the same proportion of observations from each subgroup. Again, we'll sample without replacement, in which once an observation is selected it cannot be re-selected.

To select a **stratified random sample of unequal-sized groups**, we could use the code from Example 10.10 by passing the different group sample sizes into the macro *select*. Alternatively, we could create a data set containing two count variables ...one that contains the number of observations in each subgroup (*n*) ...and the other that contains the number of observations that need to be selected from each subgroup (*k*). Once the data set is created, we could merge it with the original data set, and select observations randomly as we have done previously for random samples without replacement. That's the strategy that the following example uses.

## Example 34.12

The following code illustrates how to select a **stratified random sample of unequal-sized groups**. Specifically, the code tells SAS to randomly select 5, 6, and 8 observations, respectively, from each of the three subgroups — Bellefonte, Port Matilda, and State College — as determined by the value of the variable *city*:

` ````
DATA nselect;
set stat482.mailing (keep = city);
by city;
n+1;
if last.city;
input k;
output;
n=0;
DATALINES;
5
6
8
;
RUN;
DATA sample7 (drop = k n);
merge stat482.mailing nselect;
by city;
if ranuni(7841) le k/n then
do;
output;
k=k-1;
end;
n=n-1;
RUN;
PROC PRINT data=nselect;
title 'NSELECT: Count by CITY';
RUN;
PROC PRINT data=sample7;
title 'Sample7: Stratified Random Sample of Unequal-Sized Groups';
by city;
RUN;
```

First, launch and run * * the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the *mailing* data set, five observations from Bellefonte, six observations from Port Matilda, and eight observations from State College.

Now, how does the program work? The key to understanding the program is to understand the first DATA step. The remainder of the program is much like code we've seen before, like that in say Example 10.4, in which a random sample is selected without replacement. Now, the first DATA step creates a temporary data set called *nselect* that contains three variables *city*, *n*, and *k*:

- To count the number of observations
*n*from each*city*, we use a counter variable*n*in conjunction with the*last.city*variable. By default, SAS sets*n*to 0 on the first iteration of the DATA step, and then increases*n*by 1 for each subsequent iteration of the DATA step until it counts the number of observations for one of the levels of*city*. - To tell SAS the number of observations to select from each
*city*, we use an INPUT statement in conjunction with a DATALINES statement. The numbers*k*are listed in the order of*city*...so here we tell SAS we want to randomly select 5 observations from Bellefonte, 6 observations from Port Matilda, and 8 observations from State College. - To write the numbers
*n*and*k*to the new data set*nselect*, we use the*last.city*variable in a subsetting IF statement. So here, when SAS finds the last record within a*city*subgroup,*n*and*k*are written to the*nselect*data set, and*n*is reset to 0 in preparation for counting the number of observations for the next*city*in the data set.

The second DATA step creates a temporary data set called *sample7* by merging the *stat482.mailing* data set with the *nselect* data set. After merging, the code then randomly selects the deemed number of observations from each *city* just as we did previously for random samples without replacement.

## Example 34.13

The following code illustrates an alternative way of randomly selecting a **stratified random sample of unequal-sized groups**. In selecting such a sample, rather than specifying the desired number sampled from each group, we could tell SAS to select an equal proportion of observations from each group. The following code does just that. Specifically, the code tells SAS to randomly select 25% of the observations from each of the three subgroups — Bellefonte, Port Matilda, and State College:

` ````
DATA nselect2;
set stat482.mailing (keep=city);
by city;
n+1;
if last.city;
k=ceil(0.25*n);
output;
n=0;
RUN;
DATA sample8 (drop = k n);
merge stat482.mailing nselect2;
by city;
if ranuni(7841) le k/n then
do;
output;
k=k-1;
end;
n=n-1;
RUN;
PROC PRINT data=nselect2;
title 'NSELECT2: Count by CITY';
RUN;
PROC PRINT data=sample8;
title 'Sample8: Stratified Random Sample of Unequal-Sized Groups';
RUN
```

In this case, it probably makes most sense to first compare the code here with the code from the previous example. The only difference you should see is that rather than using an INPUT and DATALINES statement to read in the number of observations, *k*, to be selected from each of the subgroups, here we use the **ceiling function**, *ceil*( ), to determine *k*. Specifically, *k* is calculated using:

k=ceil(0.25*n);

Now, if you think about it, if I tell you to select 25% of the *n* = 16 observations in a subgroup, you'd tell me that we need to select 4 observations. But what if I tell you to select 25% of the *n* = 15 observations in a subgroup? If you calculate 25% of 15, you get 3.75. Hmmm.... how can you select 3.75 observations? That's where the ceiling function comes in to play. The ceiling function, **ceil**(argument), returns the smallest integer that is greater than or equal to the argument. So, for example, **ceil**(3.75) equals 4 ... as does of course ceil(3.1), ceil(3.25), and ceil(3.87) ...you get the idea.

That's it ... that's all there is to it. Once *k* is determined using the ceiling function, the rest of the program is identical to the program in the previous example.

Now, try it out... launch and run * * the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the *mailing* data set, 25% of the observations from Bellefonte, Port Matilda, and State College. In this case, that translates to 4, 4, and 6 observations, respectively.

## Example 34.14

The following code illustrates yet another alternative way... you've gotta be kidding me! ...of randomly selecting a **stratified random sample of unequal-sized groups**. Specifically, the program uses the SURVEYSELECT procedure to tell SAS to randomly sample exactly 5, 6, and 8 observations, respectively, from each of the three *city* subgroups in the permanent SAS data set *stat482.mailing*:

` ````
PROC SURVEYSELECT data = stat482.mailing
out = sample9
method = SRS
seed = 12345678
sampsize = (5 6 8);
strata city notsorted;
title;
RUN;
PROC PRINT data = sample9;
title1 'Sample9: Stratified Random Sample';
title2 'with Unequal-Sized Strata (using PROC SURVEYSELECT)';
RUN;
```

Straightforward enough! The only difference between this code and the code in Example 10.11 is that here the sample sizes are specified at 5, 6, and 8 rather than 5, 5, and 5. Note that you must list the stratum sample size values in the order in which the strata appear in the input data set. Like I said, straightforward enough.

Launch and run * * the SAS program. Then, review the resulting output to convince yourself that the code did indeed select, from the *mailing* data set, five observations from Bellefonte, six observations from Port Matilda, and eight observations from State College.