16.4 - Excluding Unmatched Observations

By default, when match-merging, the DATA step combines all of the observations in all of the input data sets. There may be situations, however, in which we want SAS to select only those observations for which a match exists in all of the input data sets. This is a rather common thing to do when, for instance, one of the data sets contains a master list of patients, say, and the other data sets contain information on actions taken by the patients. When we merge the data sets, we'd only want to include those patients who appear both in the master list data set and in the action data set. For example, suppose we have a data set containing a master list of patients:

id	v_date
110011	01/01/06
110012	01/02/06
110013	01/04/06

and a data set called allvoids containing information about some patients' urinary voiding habits (void_no and volume) for a given visit date (v_date):

id	v_date	void_no	volume
110011	01/01/06	1	250
110011	01/01/06	2	300
110011	01/01/06	3	302
110011	01/01/06	4	231
110012	01/02/06	1	305
110012	01/02/06	2	225
110012	01/02/06	3	400
110013	01/04/06	1	300
110013	01/04/06	2	333
110013	01/04/06	3	401
110013	01/04/06	4	404
110014	01/06/06	1	398
110014	01/06/06	2	413

Note that the patients data set contains three id numbers (110011, 110012, and 110013), whereas the allvoids data set contains information on four id numbers (110011, 110012, 110013, and 110014). Because the patients data set is viewed as the master list of patients to include in any statistical analysis, we would want to merge the patients and allvoids data set in such a way to ensure that only those patients whose id appears in both input data sets are included in the output analysis data set:

id	v_date	void_no	volume
110011	01/01/06	1	250
110011	01/01/06	2	300
110011	01/01/06	3	302
110011	01/01/06	4	231
110012	01/02/06	1	305
110012	01/02/06	2	225
110012	01/02/06	3	400
110013	01/04/06	1	300
110013	01/04/06	2	333
110013	01/04/06	3	401
110013	01/04/06	4	404

Fortunately, telling SAS to exclude unmatched observations is as simple as using the IN= data set option along with a subsetting IF statement. Specifically:

We use the IN= DATA step option to create a variable that indicates whether the data set contributed to the current observation.
Then, we use a subsetting IF statement to check the IN= values and to write to the output (merged) data set only those observations that appear in the data sets for which IN= specified.

Let's try this approach out on our patients and allvoids data sets!

Example 16.9

The following program simply reads in the patients and allvoids data sets:

DATA patients;
    input id v_date : mmddyy8.;
    format v_date mmddyy8.;
    DATALINES;
    110011 01/01/06
    110012 01/02/06
    110013 01/04/06;
RUN;
     
DATA allvoids;
    input id v_date : mmddyy8. void_no volume;
    format v_date mmddyy8.;
    DATALINES;
    110011  01/01/06 1 250
    110011  01/01/06 2 300
    110011  01/01/06 3 302
    110011  01/01/06 4 231
    110012  01/02/06 1 305
    110012  01/02/06 2 225
    110012  01/02/06 3 400
    110013  01/04/06 1 300
    110013  01/04/06 2 333
    110013  01/04/06 3 401
    110013  01/04/06 4 404
    110014  01/06/06 1 398
    110014  01/06/06 2 413;
RUN;

Launch and run the program, so that we can work with the data sets. Note that the two data sets are linked by the variables id and v_date. The patients data set contains one observation for each of the three patients. And, the allvoids data set contains multiple observations for each of the four patients. Incidentally, in case you are curious, each observation in the allvoids data set corresponds to a patient's reported void of urine. For example, subject 110011 reported four voids on 01/01/06, subject 110012 reported three voids on 01/02/06, and so on.

Now, if we wanted to analyze the voiding data, we'd have to make sure that we didn't include any data from patients not included in the patients data set. That is, we'd want to exclude the voiding data corresponding to subject 110014. If we merge the patients and allvoids data sets by the id and v_date variables, we get an analysis data set that has one observation for each patient's reported void:

DATA analysis;
    merge patients allvoids;
    by id v_date;
RUN;
      
PROC PRINT data=analysis NOOBS; 
    title 'The analysis data set'; 
RUN;

The analysis data set
id	v_date	void_no	volume
110011	01/01/06	1	250
110011	01/01/06	2	300
110011	01/01/06	3	302
110011	01/01/06	4	231
110012	01/02/06	1	305
110012	01/02/06	2	225
110012	01/02/06	3	400
110013	01/04/06	1	300
110013	01/04/06	2	333
110013	01/04/06	3	401
110013	01/04/06	4	404
110014	01/06/06	1	398
110014	01/06/06	2	413

Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the data sets were merged as described. You might also want to take particular note that this is the first example we've seen in which we have specified two variables in the BY statement. There is nothing that much more difficult about it. Other than making sure that both of the input data sets are sorted by both the id and v_date variables.

In reviewing the output, you should also note that we have not yet achieved what we set out to do, namely to create an analysis data set that contains only the voiding data for the patients appearing in the patients data set. That is, the analysis data set still includes the voiding data on subject 110014. The following code uses the IN= option and a subsetting IF statement to help us accomplish our task:

DATA analysis;
    merge patients (in = inpatients)
            allvoids (in = inallvoids);
    by id v_date;
    if inpatients and inallvoids;
RUN;
      
PROC PRINT data=analysis NOOBS; 
    title 'The analysis data set'; 
RUN;

The analysis data set
id	v_date	void_no	volume
110011	01/01/06	1	250
110011	01/01/06	2	300
110011	01/01/06	3	302
110011	01/01/06	4	231
110012	01/02/06	1	305
110012	01/02/06	2	225
110012	01/02/06	3	400
110013	01/04/06	1	300
110013	01/04/06	2	333
110013	01/04/06	3	401
110013	01/04/06	4	404

The IN = inpatients option tells SAS to assign a value of 1 to the inpatients variable when an observation from the patients data set contributes to the current observation. Likewise, the IN = inallvoids option tells SAS to assign a value of 1 to the inallvoids variable when an observation from the allvoids data set contributes to the current observation. The subsetting IF statement tells SAS to write only those observations to the analysis data set whose value for both inpatients and inallvoids is 1, that is, only those observations that were created from observations in both the patients and allvoids data sets. Launch and run the SAS program, and review the output from the PRINT procedure to convince yourself that the analysis data set now contains only the voiding data for the patients appearing in the patients data set.

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility