It is often the case that input data sets contain many more variables than you want in your final merged data set. Just as you can use the DROP= and KEEP= DATA step options to drop or keep a subset of variables when reading SAS data sets, you can use the DROP= and KEEP= options to drop or keep a subset of variables when merging SAS data sets. Let's take a look at an example!
Example 16.10 Section
The following program prints a subset of the observations in the ICDB Study's background (back), pain and urgency (purg), and family history (fhx) data sets:
LIBNAME icdb 'C:\yourdrivename\Stat481WC\04combineII\sasndata';
PROC PRINT data = icdb.back (OBS=5) NOOBS;
title 'The back data set';
RUN;
PROC PRINT data = icdb.purg (OBS=5) NOOBS;
title 'The purg data set';
where v_type = 0;
RUN;
PROC PRINT data = icdb.fhx (OBS=5) NOOBS;
title 'The fhx data set';
RUN;
subj | v_type | v_date | r_id | b_date | sex | state | country | race | ethnic | relig | mar_st | ed_level | emp_st | job_chng | income |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
110027 | 0 | 10/05/93 | 2068 | 07/05/62 | 2 | 5 | 1 | 4 | 0 | 0 | 1 | 3 | 1 | . | 2 |
110029 | 0 | 10/05/93 | 2068 | 09/07/26 | 2 | 5 | 1 | 4 | 0 | 2 | 1 | 5 | 8 | . | 2 |
110039 | 0 | 12/07/93 | 2068 | 07/24/24 | 2 | 22 | 1 | 4 | 0 | 3 | 1 | 3 | 8 | . | 2 |
110040 | 0 | 11/30/93 | 2068 | 10/20/67 | 2 | 32 | 1 | 4 | 0 | 7 | 1 | 5 | 1 | . | 2 |
110045 | 0 | 01/11/94 | 2068 | 04/18/25 | 1 | 36 | 1 | 4 | 0 | 3 | 1 | 1 | 8 | 0 | 2 |
subj | v_type | v_date | r_id | purg_1 | purg_2 |
---|---|---|---|---|---|
110027 | 0 | 10/05/93 | 2068 | 8 | 8 |
110029 | 0 | 10/05/93 | 2068 | 8 | 8 |
110039 | 0 | 12/07/93 | 2068 | 2 | 3 |
110040 | 0 | 11/30/93 | 2068 | 3 | 5 |
110045 | 0 | 01/11/94 | 2068 | 2 | 2 |
subj | v_type | v_date | i_id | fhx_1 | fhx_2 | fhx_3 | fhx_4 | fhx_5 | fhx_6 | fhx_7 | fhx_8 | fhx_9 | fhx_10 | fhx_11 | fhx_12 | fhx_13 | fhx_14 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
110027 | 0 | 10/05/93 | 2068 | 0 | 0 | 0 | . | 0 | 1 | 1 | 1 | 0 | 0 | . | 0 | 1 | 0 |
110029 | 0 | 10/05/93 | 2068 | 0 | 0 | 0 | . | 1 | 1 | 8 | 0 | 8 | 0 | . | 0 | 0 | 8 |
110039 | 0 | 12/07/93 | 2068 | 1 | 0 | 0 | . | 0 | 0 | 0 | 0 | 0 | 0 | . | 0 | 0 | 1 |
110040 | 0 | 11/30/93 | 2068 | 0 | 0 | 8 | . | 8 | 8 | 1 | 1 | 8 | 0 | . | 1 | 8 | 1 |
110045 | 0 | 01/11/94 | 2068 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 |
In each case, the PRINT procedure's OBS=5 option tells SAS to print just the first five observations of each data set. The second PRINT procedure contains an extra direction to SAS to print only those observations for which the v_type variable equals 0. Before you can run the program, you'll need to right-click the data set links... back, purg, and fhx... so that you can save the data sets to a convenient location on your computer. After doing so, launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the data sets. Then, run the program, and review the output so that you can familiarize yourself with each data set.
As you can see, the data sets share three variables — subj, v_type, and v_date, and then each data set has its own set of unique variables. Our interest in this example is to merge the three data sets into a new data set called icdbdata that contains the common subj variable, the b_date and sex variables from the back data set, the purg_1 and purg_2 variables from the purg data set, and the fhx_1 and fhx_2 variables from the fhx data set. That is, when all is said and done, this is what the first ten observations of the icdbdata set should look like:
subj | b_date | sex | purg_1 | purg_2 | fhx_1 | fhx_2 |
---|---|---|---|---|---|---|
110027 | 07/05/62 | 2 | 8 | 8 | 0 | 0 |
110029 | 09/07/26 | 2 | 8 | 8 | 0 | 0 |
110039 | 07/24/24 | 2 | 2 | 3 | 1 | 0 |
110040 | 10/20/67 | 2 | 3 | 5 | 0 | 0 |
110045 | 04/18/25 | 1 | 2 | 2 | 0 | 0 |
110049 | 10/05/23 | 2 | 3 | 3 | 1 | 0 |
110051 | 12/02/23 | 2 | 6 | 7 | 0 | 8 |
110052 | 1202/42 | 2 | 6 | 6 | 0 | 0 |
110053 | 03/15/22 | 2 | 7 | 7 | 0 | 0 |
110055 | 03/31/41 | 2 | 6 | 6 | 0 | 0 |
The following code does the trick for us:
LIBNAME icdb 'C:\yourdrivename\Stat481WC\04combineII\sasndata';
DATA icdbdata (drop = v_type);
merge icdb.back (keep = subj b_date sex)
icdb.purg (where = (v_type = 0)
keep = subj v_type purg_1 purg_2)
icdb.fhx (keep = subj fhx_1 fhx_2);
by subj;
RUN;
PROC PRINT data = icdbdata (OBS = 10) NOOBS;
title 'The icdbdata data set';
RUN;
subj | b_date | sex | purg_1 | purg_2 | fhx_1 | fhx_2 |
---|---|---|---|---|---|---|
110027 | 07/05/62 | 2 | 8 | 8 | 0 | 0 |
110029 | 09/07/26 | 2 | 8 | 8 | 0 | 0 |
110039 | 07/24/24 | 2 | 2 | 3 | 1 | 0 |
110040 | 10/20/67 | 2 | 3 | 5 | 0 | 0 |
110045 | 04/18/25 | 1 | 2 | 2 | 0 | 0 |
110049 | 10/05/23 | 2 | 3 | 3 | 1 | 0 |
110051 | 12/02/42 | 2 | 6 | 7 | 0 | 8 |
110052 | 01/04/25 | 2 | 6 | 6 | 0 | 0 |
110053 | 03/15/22 | 2 | 7 | 7 | 0 | 0 |
110055 | 03/31/41 | 2 | 6 | 6 | 0 | 0 |
At a global level, the DATA step tells SAS to merge the (permanent) back, purg, and fhx data sets by the subj variable. The data set options give SAS more specific directions about what to do while match-merging the data sets. Specifically:
- The KEEP= option attached to the icdb.back data set tells SAS to keep just three variables — subj, b_date, and sex — from the icdb.back data set.
- The WHERE= option attached to the icdb.purg data set tells SAS to select the observations for which the v_type variable equals 0.
- The KEEP= option attached to the icdb.purg data set tells SAS to keep four variables — subj, v_type, purg_1, and purg_2 — from the icdb.purg data set. Although we don't want the v_type variable in the final icdbdata data set, we still need to tell SAS to keep the variable just so we can use it to select the observations for which v_type equals 0.
- The KEEP= option attached to the icdb.fhx data set tells SAS to keep just three variables — subj, fhx_1, and fhx_2 — from the icdb.purg data set.
- Finally, the DROP= option attached to the icdbdata output data set tells SAS to drop the v_type variable from the output data set.
Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the icdbdata data set contains only the variables and observations that we desired.