16.5 - Selecting Variables

It is often the case that input data sets contain many more variables than you want in your final merged data set. Just as you can use the DROP= and KEEP= DATA step options to drop or keep a subset of variables when reading SAS data sets, you can use the DROP= and KEEP= options to drop or keep a subset of variables when merging SAS data sets. Let's take a look at an example!

Example 16.10 Section

The following program prints a subset of the observations in the ICDB Study's background (back), pain and urgency (purg), and family history (fhx) data sets:

LIBNAME icdb 'C:\yourdrivename\Stat481WC\04combineII\sasndata';
PROC PRINT data = icdb.back (OBS=5) NOOBS;
        title 'The back data set';
RUN;       
PROC PRINT data = icdb.purg (OBS=5) NOOBS;
        title 'The purg data set';
                    where v_type = 0;
RUN;        
PROC PRINT data = icdb.fhx (OBS=5) NOOBS;
        title 'The fhx data set';
RUN;

The back data set

subj

v_type

v_date

r_id

b_date

sex

state

country

race

ethnic

relig

mar_st

ed_level

emp_st

job_chng

income

110027

0

10/05/93

2068

07/05/62

2

5

1

4

0

0

1

3

1

.

2

110029

0

10/05/93

2068

09/07/26

2

5

1

4

0

2

1

5

8

.

2

110039

0

12/07/93

2068

07/24/24

2

22

1

4

0

3

1

3

8

.

2

110040

0

11/30/93

2068

10/20/67

2

32

1

4

0

7

1

5

1

.

2

110045

0

01/11/94

2068

04/18/25

1

36

1

4

0

3

1

1

8

0

2


The purg data set

subj

v_type

v_date

r_id

purg_1

purg_2

110027

0

10/05/93

2068

8

8

110029

0

10/05/93

2068

8

8

110039

0

12/07/93

2068

2

3

110040

0

11/30/93

2068

3

5

110045

0

01/11/94

2068

2

2


The fhx data set

subj

v_type

v_date

i_id

fhx_1

fhx_2

fhx_3

fhx_4

fhx_5

fhx_6

fhx_7

fhx_8

fhx_9

fhx_10

fhx_11

fhx_12

fhx_13

fhx_14

110027

0

10/05/93

2068

0

0

0

.

0

1

1

1

0

0

.

0

1

0

110029

0

10/05/93

2068

0

0

0

.

1

1

8

0

8

0

.

0

0

8

110039

0

12/07/93

2068

1

0

0

.

0

0

0

0

0

0

.

0

0

1

110040

0

11/30/93

2068

0

0

8

.

8

8

1

1

8

0

.

1

8

1

110045

0

01/11/94

2068

0

0

1

0

0

0

1

1

0

1

0

0

0

1

In each case, the PRINT procedure's OBS=5 option tells SAS to print just the first five observations of each data set. The second PRINT procedure contains an extra direction to SAS to print only those observations for which the v_type variable equals 0. Before you can run the program, you'll need to right-click the data set links... back, purg, and fhx... so that you can save the data sets to a convenient location on your computer. After doing so, launch the SAS program, and edit the LIBNAME statement so that it reflects the location in which you saved the data sets. Then, run  the program, and review the output so that you can familiarize yourself with each data set.

As you can see, the data sets share three variables — subj, v_type, and v_date, and then each data set has its own set of unique variables. Our interest in this example is to merge the three data sets into a new data set called icdbdata that contains the common subj variable, the b_date and sex variables from the back data set, the purg_1 and purg_2 variables from the purg data set, and the fhx_1 and fhx_2 variables from the fhx data set. That is, when all is said and done, this is what the first ten observations of the icdbdata set should look like:

subj

b_date

sex

purg_1

purg_2

fhx_1

fhx_2

110027

07/05/62

2

8

8

0

0

110029

09/07/26

2

8

8

0

0

110039

07/24/24

2

2

3

1

0

110040

10/20/67

2

3

5

0

0

110045

04/18/25

1

2

2

0

0

110049

10/05/23

2

3

3

1

0

110051

12/02/23

2

6

7

0

8

110052

1202/42

2

6

6

0

0

110053

03/15/22

2

7

7

0

0

110055

03/31/41

2

6

6

0

0

The following code does the trick for us:

LIBNAME icdb 'C:\yourdrivename\Stat481WC\04combineII\sasndata';
DATA icdbdata (drop = v_type);
   merge icdb.back (keep = subj b_date sex)
           icdb.purg (where = (v_type = 0)
                      keep = subj v_type purg_1 purg_2)
           icdb.fhx  (keep = subj fhx_1 fhx_2);
   by subj;
RUN;
PROC PRINT data = icdbdata (OBS = 10) NOOBS;
        title 'The icdbdata data set';
RUN;

The icdbdata data set

subj

b_date

sex

purg_1

purg_2

fhx_1

fhx_2

110027

07/05/62

2

8

8

0

0

110029

09/07/26

2

8

8

0

0

110039

07/24/24

2

2

3

1

0

110040

10/20/67

2

3

5

0

0

110045

04/18/25

1

2

2

0

0

110049

10/05/23

2

3

3

1

0

110051

12/02/42

2

6

7

0

8

110052

01/04/25

2

6

6

0

0

110053

03/15/22

2

7

7

0

0

110055

03/31/41

2

6

6

0

0

At a global level, the DATA step tells SAS to merge the (permanent) back, purg, and fhx data sets by the subj variable. The data set options give SAS more specific directions about what to do while match-merging the data sets. Specifically:

  • The KEEP= option attached to the icdb.back data set tells SAS to keep just three variables — subj, b_date, and sex — from the icdb.back data set.
  • The WHERE= option attached to the icdb.purg data set tells SAS to select the observations for which the v_type variable equals 0.
  • The KEEP= option attached to the icdb.purg data set tells SAS to keep four variables — subj, v_type, purg_1, and purg_2 — from the icdb.purg data set. Although we don't want the v_type variable in the final icdbdata data set, we still need to tell SAS to keep the variable just so we can use it to select the observations for which v_type equals 0.
  • The KEEP= option attached to the icdb.fhx data set tells SAS to keep just three variables — subj, fhx_1, and fhx_2 — from the icdb.purg data set.
  • Finally, the DROP= option attached to the icdbdata output data set tells SAS to drop the v_type variable from the output data set.

Launch and run  the SAS program, and review the output from the PRINT procedure to convince yourself that the icdbdata data set contains only the variables and observations that we desired.