17.1 - The OUTPUT Statement

An OUTPUT statement overrides the default process by telling SAS to output the current observation when the OUTPUT statement is processed — not at the end of the DATA step. The OUTPUT statement takes the form:

 OUTPUT dataset1 dataset2 ... datasetn;

where you may name as few or as many data sets as you like. If you use an OUTPUT statement without specifying a data set name, SAS writes the current observation to each of the data sets named in the DATA step. Any data set name appearing in the OUTPUT statement must also appear in the DATA statement.

The OUTPUT statement is pretty powerful in that, among other things, it gives us a way:

  • to write observations to multiple data sets
  • to control the output of observations to data sets based on certain conditions
  • to transpose datasets using the OUTPUT statement in conjunction with the RETAIN statement, BY group processing, and the LAST.variable statement.

Throughout the rest of this section, we'll look at examples that illustrate how to use OUTPUT statements correctly. We'll work with the following subset of the ICDB Study's log data set:

The icblog data set

SUBJ

V_TYPE

V_DATE

FORM

210006

12

05/06/94

cmed

210006

12

05/06/94

diet

210006

12

05/06/94

med

210006

12

05/06/94

phytrt

210006

12

05/06/94

purg

210006

12

05/06/94

qul

210006

12

05/06/94

sympts

210006

12

05/06/94

urn

210006

12

05/06/94

void

310032

24

09/19/95

backf

310032

24

09/19/95

cmed

310032

24

09/19/95

diet

310032

24

09/19/95

med

310032

24

09/19/95

medhxf

310032

24

09/19/95

phs

310032

24

09/19/95

phytrt

310032

24

09/19/95

preg

310032

24

09/19/95

purg

310032

24

09/19/95

qul

310032

24

09/19/95

sympts

310032

24

09/19/95

urn

310032

24

09/19/95

void

410010

6

05/12/94

cmed

410010

6

05/12/94

diet

410010

6

05/12/94

med

410010

6

05/12/94

phytrt

410010

6

05/12/94

purg

410010

6

05/12/94

qul

410010

6

05/12/94

sympts

410010

6

05/12/94

urn

410010

6

05/12/94

void

As you can see, this log data set contains four variables:

  • subj: the subject's identification number
  • v_type: the type of clinic visit, which means the number of months since the subject was first seen in the clinic
  • v_date: the date of the clinic visit
  • form: codes that indicate the data forms that were completed during the subject's clinic visit

The log data set is a rather typical data set that arises from large national clinical studies in which there are a number of sites around the country where data are collected. Typically, the clinical sites collect the data on data forms and then "ship" the data forms either electronically or by mail to a centralized location called a Data Coordinating Center (DCC). As you can well imagine, keeping track of the data forms at the DCC is a monumental task. For the ICDB Study, for example, the DCC received more than 68,000 data forms over the course of the study.

In order to keep track of the data forms that arrive at the DCC, they are "logged" into a database and subsequently tracked as they are processed at the DCC. In reality, a log database will contain many more variables than we have in our subset, such as the dates the data on the forms were entered into the database, who entered the data, the dates the entered data was verified, who verified the data, and so on. To keep our lives simple, we'll just use the four variables described above.

Example 17.1 Section

This example uses the OUTPUT statement to tell SAS to write observations to data sets based on certain conditions. Specifically, the following program uses the OUTPUT statement to create three SAS data sets — s210006, s310032, and s410010 — based on whether the subject identification numbers in the icdblog data set meet a certain condition:

OPTIONS PS=58 LS=80 NODATE NONUMBER;
LIBNAME stat481 'C:\yourdrivename\Stat481WC\05retain\sasndata';

DATA s210006 s310032 s410010;
    set stat481.icdblog;
        if (subj = 210006) then output s210006;
    else if (subj = 310032) then output s310032;
    else if (subj = 410010) then output s410010;
RUN;
PROC PRINT data = s210006 NOOBS;
    title 'The s210006 data set';
RUN;
PROC PRINT data = s310032 NOOBS;
    title 'The s310032 data set';
RUN;
PROC PRINT                NOOBS;
    title 'The s410010 data set';
RUN;   

The s210006 data set

SUBJ

V_TYPE

V_DATE

FORM

210006

12

05/06/94

cmed

210006

12

05/06/94

diet

210006

12

05/06/94

med

210006

12

05/06/94

phytrt

210006

12

05/06/94

purg

210006

12

05/06/94

qul

210006

12

05/06/94

sympts

210006

12

05/06/94

urn

210006

12

05/06/94

void


The s310032 data set

SUBJ

V_TYPE

V_DATE

FORM

310032

24

09/19/95

backf

310032

24

09/19/95

cmed

310032

24

09/19/95

diet

310032

24

09/19/95

med

310032

24

09/19/95

medhxf

310032

24

09/19/95

phs

310032

24

09/19/95

phytrt

310032

24

09/19/95

preg

310032

24

09/19/95

purg

310032

24

09/19/95

qul

310032

24

09/19/95

sympts

310032

24

09/19/95

urn

310032

24

09/19/95

void


The s410010 data set

SUBJ

V_TYPE

V_DATE

FORM

410010

6

05/12/94

cmed

410010

6

05/12/94

diet

410010

6

05/12/94

med

410010

6

05/12/94

phytrt

410010

6

05/12/94

purg

410010

6

05/12/94

qul

410010

6

05/12/94

sympts

410010

6

05/12/94

urn

410010

6

05/12/94

void

As you can see, the DATA statement contains three data set names — s210006, s310032, and s410010. That tells SAS that we want to create three data sets with the given names. The SET statement, of course, tells SAS to read observations from the permanent data set called stat481.icdblog. Then comes the IF-THEN-ELSE and OUTPUT statements that make it all work. The first IF-THEN tells SAS to output any observations pertaining to subject 210006 to the s210006 data set; the second IF-THEN tells SAS to output any observations pertaining to subject 310032 to the s310032 data set; and, the third IF-THEN statement tells SAS to output any observations pertaining to subject 410010 to the s410010 data set. SAS will hiccup if you have a data set name that appears in an OUTPUT statement without it also appearing in the DATA statement.

The PRINT procedures, of course, tell SAS to print the three newly created data sets. Note that the last PRINT procedure does not have a DATA= option. That's because when you name more than one data set in a single DATA statement, the last name on the DATA statement is the most recently created data set, and the one that subsequent procedures use by default. Therefore, the last PRINT procedure will print the s410010 data set by default.

Now, before launching and running the SAS program, right-click to save the icdblog data set to a convenient location on your computer. Then, launch the SAS program and edit the LIBNAME statement so that it reflects the location in which you saved the data set. Then, run  the program and review the output from the PRINT procedures. You should see that, as expected, the data set s210006 contains data on subject 210006; the data set s310032 contains data on subject 310032; and s410010 contains data on subject 410010.

Incidentally, note that the IF-THEN-ELSE construct used here in conjunction with the OUTPUT statement is comparable to attaching the WHERE= option to each of the data sets appearing in the DATA statement.

Example 17.2 Section

Using an OUTPUT statement suppresses the automatic output of observations at the end of the DATA step. Therefore, if you plan to use any OUTPUT statements in a DATA step, you must use OUTPUT statements to program all of the output for that step. The following SAS program illustrates what happens if you fail to direct all of the observations to output:

DATA subj210006 subj310032;
    set stat481.icdblog;
    if (subj = 210006) then output subj210006;
RUN;
PROC PRINT data = subj210006 NOOBS;
    title 'The subj210006 data set';
RUN;
PROC PRINT data = subj310032 NOOBS;
    title 'The subj310032 data set';
RUN;

The subj210006 data set
SUBJV_TYPEV_DATEFORM
2100061205/06/94cmed
2100061205/06/94diet
2100061205/06/94med
2100061205/06/94phytrt
2100061205/06/94purg
2100061205/06/94qul
2100061205/06/94sympts
2100061205/06/94urn
2100061205/06/94void

The DATA statement contains two data set names, subj210006 and subj310032, telling SAS that we intend to create two data sets. However, as you can see, the IF statement contains an OUTPUT statement that directs output to the subj210006 data set, but no OUTPUT statement directs output to the subj310032 data set. Launch and run  the SAS program to convince yourself that the subj210006 data set contains data for subject 210006, while the subj310032 data set contains 0 observations. You should see a message like this in the log window:

PROC PRINT data = subj310032 NOOBS;
       title 'The subj310032 data set';
RUN;
NOTE: No observations in data set WORK.SUBJ310032.
NOTE: PROCEDURE PRINT used (Total process time):
      real time           0.00 seconds
      cpu time            0.01 seconds

as well as see that no output for the subj310032 data set appears in the output window.

Example 17.3 Section

If you use an assignment statement to create a new variable in a DATA step in the presence of OUTPUT statements, you have to make sure that you place the assignment statement before the OUTPUT statements. Otherwise, SAS will have already written the observation to the SAS data set, and the newly created variable will be set to missing. The following SAS program illustrates an example of how two variables, current and days_vis, get set to missing in the output data sets because their values get calculated after SAS has already written the observation to the SAS data set:

DATA subj210006 subj310032 subj410010;
    set stat481.icdblog;
        if (subj = 210006) then output subj210006;
    else if (subj = 310032) then output subj310032;
    else if (subj = 410010) then output subj410010;
    current = today();
    days_vis = current - v_date;
    format current mmddyy8.;
RUN;
PROC PRINT data = subj310032 NOOBS;
    title 'The subj310032 data set';
RUN;

The subj310032 data set
SUBJV_TYPEV_DATEFORMcurrentdays_vis
3100322409/19/95backf..
3100322409/19/95cmed..
3100322409/19/95diet..
3100322409/19/95med..
3100322409/19/95medhxf..
3100322409/19/95phs..
3100322409/19/95phytrt..
3100322409/19/95preg..
3100322409/19/95purg..
3100322409/19/95qul..
3100322409/19/95sympts..
3100322409/19/95urn..
3100322409/19/95void..

The main thing to note in this program is that the current and days_vis assignment statements appear after the IF-THEN-ELSE and OUTPUT statements. That means that each observation will be written to one of the three output data sets before the current and days_vis values are even calculated. Because SAS sets variables created in the DATA step as missing at the beginning of each iteration of the DATA step, the values of current and days_vis will remain missing for each observation.

By the way, the today( ) function, which is assigned to the variable current, creates a date variable containing today's date. Therefore, the variable days_vis is meant to contain the number of days since the subject's recorded visit v_date. However, as described above, the values of current and days_vis get set to missing. Launch and run  the SAS program to convince yourself that the current and days_vis variables in the subj310032 data set contain only missing values. If we were to print the subj210006 and subj410020 data sets, we would see the same thing.

The following SAS program illustrates the corrected code for the previous DATA step, that is, for creating new variables with assignment statements in the presence of OUTPUT statements:

DATA subj210006 subj310032 subj410010;
    set stat481.icdblog;
    current = today();
    days_vis = current - v_date;
    format current mmddyy8.;
        if (subj = 210006) then output subj210006;
    else if (subj = 310032) then output subj310032;
    else if (subj = 410010) then output subj410010;
RUN;
PROC PRINT data = subj310032 NOOBS;
    title 'The subj310032 data set';
RUN;

The subj310032 data set
SUBJV_TYPEV_DATEFORMcurrentdays_vis
3100322409/19/95backf09/06/2310214
3100322409/19/95cmed09/06/2310214
3100322409/19/95diet09/06/2310214
3100322409/19/95med09/06/2310214
3100322409/19/95medhxf09/06/2310214
3100322409/19/95phs09/06/2310214
3100322409/19/95phytrt09/06/2310214
3100322409/19/95preg09/06/2310214
3100322409/19/95purg09/06/2310214
3100322409/19/95qul09/06/2310214
3100322409/19/95sympts09/06/2310214
3100322409/19/95urn09/06/2310214
3100322409/19/95void09/06/2310214

Now, since the assignment statements precede the OUTPUT statements, the variables are correctly written to the output data sets. That is, now the variable current contains the date in which the program was run and the variable days_vis contains the number of days since that date and the date of the subject's visit. Launch and run  the SAS program to convince yourself that the current and days_vis variables are properly written to the subj310032 data set. If we were to print the subj210006 and subj410020 data sets, we would see similar results.

Example 17.4 Section

After SAS processes an OUTPUT statement within a DATA step, the observation remains in the program data vector and you can continue programming with it. You can even output the observation again to the same SAS data set or to a different one! The following SAS program illustrates how you can create different data sets with some of the same observations. That is, the data sets created in your DATA statement do not have to be mutually exclusive:

DATA symptoms visitsix;
    set stat481.icdblog;
    if form = 'sympts' then output symptoms;
    if v_type = 6 then output visitsix;
RUN;
PROC PRINT data = symptoms NOOBS;
    title 'The symptoms data set';
RUN;
PROC PRINT data = visitsix NOOBS;
    title 'The visitsix data set';
RUN;

The symptoms data set

 

SUBJ

V_TYPE

V_DATE

FORM

210006

12

05/06/94

sympts

310032

24

09/19/95

sympts

410010

6

05/12/94

sympts


The visitsix data set

SUBJ

V_TYPE

V_DATE

FORM

410010

6

05/12/94

cmed

410010

6

05/12/94

diet

410010

6

05/12/94

med

410010

6

05/12/94

phytrt

410010

6

05/12/94

purg

410010

6

05/12/94

qul

410010

6

05/12/94

sympts

410010

6

05/12/94

urn

410010

6

05/12/94

void

The DATA step creates two temporary data sets, symptoms and visitsix. The symptoms data set contains only those observations containing a form code of sympts. The visitsix data set, on the other hand, contains observations for which v_type equals 6. The observations in the two data sets are therefore not necessarily mutually exclusive. In fact, launch and run  the SAS program and review the output from the PRINT procedures. Note that the observation for subject 410010 in which form = sympts is contained in both the symptoms and visitsix data sets.