3.3 - Formatted Input

The fundamental difference between column input, which we studied in the previous lesson, and formatted input, which we'll explore now, is that column input is only appropriate for reading standard numeric data, while formatted input allows us to read both standard and nonstandard numeric data. That is, formatted input combines the features of column input with the ability to read nonstandard data values.

Standard numeric data values

Recall that standard numeric data values can contain only:

numbers
decimal points
numbers using scientific (E) notation
negative (minus) and positive (plus) signs

Examples of standard numeric values include 26, 3.9, -13, +3.14, 314E-2, and 2.193E3.

Nonstandard numeric data values

On the other hand, nonstandard numeric data values include:

values that contain special characters, such as dollar signs ($), percent signs (%), and commas (,)
date and time values
data in fraction, integer binary, real binary, and hexadecimal forms

Examples of nonstandard numeric values include: 23.3%, $1.26, and 03/07/47.

The INPUT Statement

Here's the general form of the INPUT statement when using formatted input:

INPUT <pointer-control> variable informat.;

where:

pointer-control tells SAS at what column to start reading the data value
variable is the name of the variable being created
informat is a special instruction that tells SAS how to read the raw data values

A couple of things here. The above INPUT statement is written using standard SAS Help notation. For example, the pointer control appears in brackets (<>) only to indicate that it is optional, or rather, not necessary for every variable you create. For example, you need not tell SAS to go to column 1 if that's where you want to start reading data values, because that's where SAS starts by default. In practice, the brackets (<>) should not appear in the INPUT statements in your program ... otherwise, SAS will hiccup.

There are two pointer controls that we'll learn about here:

The @n pointer control moves the input pointer to a specific column number n
The +n pointer control moves the input pointer forward n columns to a column number that is relative to the current position

Again, an informat is what is used to tell SAS what special instructions are required to read in the raw data values. Many (too many?) special informats are available in SAS. For example, the numeric informat "mmddyy8." tells SAS that you want to read in a date that takes up to 8 spaces and looks like 10/27/05. The numeric informat "comma6." tells SAS that you want to read in a number that contains a comma and takes up to 6 spaces (including the comma), such as the number 11,387.

Let's take a look at an example!

Example 3.8

The following SAS program uses the @n column pointer control and standard numeric informats to read three numeric variables — subj, height, and weight — into a temporary SAS data set called temp:

DATA temp;
  input @1  subj 4. 
        @27 height 2. 
        @30 weight 3.;
  DATALINES;
1024 Alice       Smith  1 65 125 12/1/95  2,036
1167 Maryann     White  1 68 140 12/01/95 1,800
1168 Thomas      Jones  2    190 12/2/95  2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia     Ho     1 63 115 1/1/96   1,972
  ;
RUN;
PROC PRINT data = temp;
  title 'Output dataset: TEMP';
RUN;

Output dataset: TEMP
Obs	subj	height	weight
1	1024	65	125
2	1167	68	140
3	1168	.	190
4	1201	68	190
5	1302	63	115

If you look at the INPUT statement, you'll see that it uses @n absolute pointer controls to move the input pointer first to column 1 to read subj, then to column 27 to read height, and finally to column 30 to read weight. In general, the @ moves the pointer to column n, the first column of the field that is being read.

The 4. that appears after subj, the 2. that appears after height, and 3. that appears after weight are the informats that tells SAS how to read each of the raw data values. In each case here, we are trying to read in standard numeric data values, and so we use what is called the w.d informat. The w tells SAS the width of the raw data value, that is how many columns it occupies. The d, which is optional, tells SAS the number of implied decimal places for the raw data value. The w and d must be connected by a period (.) delimiter. If the d is not present, you still need to make sure that you include the period in the informat name.

Making this all a little more concrete ... here, the subj values are four columns wide with no decimal places, and hence we use the 4. informat. We alternatively could have specified a 4.0 informat, but we could not have specified 4 (without the period) as the informat. The height values are two columns wide with no decimal places, and hence we use the 2. informat. Finally, the weight values are three columns wide with no decimal places, and hence we use the 3. informat.

Incidentally, the w.d informat ignores any d that we specify if the data value already contains a decimal point. So, for example, if we had a raw data value of 23.001 (occupying 6 columns with 5 digits, 1 decimal point, and 3 decimal places) and specified a 6. informat, SAS would still store the value as 23.001, even though we told SAS not to expect any decimal places.

Okay ... launch and run the SAS program, and review the output obtained from the print procedure to convince yourself that the three variables were read incorrectly. Oh, do you remember when I said that you needn't tell SAS to move the input pointer to the first column because it does so by default? You might want to convince yourself of this by removing the @1 that appears in the INPUT statement. Then, rerun the program to convince yourself that SAS still reads the data values properly.

Example 3.9

The following SAS program uses the @n column pointer control and standard character and numeric informats to read, respectively, two character variables — l_name and f_name — and two numeric variables — weight and height — into a temporabackwardry SAS data set called temp:

DATA temp;
  input @18 l_name $6.
        @6  f_name $11.
		@30 weight 3.
		@27 height 2.;
  DATALINES;
1024 Alice       Smith  1 65 125 12/1/95  2,036
1167 Maryann     White  1 68 140 12/01/95 1,800
1168 Thomas      Jones  2    190 12/2/95  2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia     Ho     1 63 115 1/1/96   1,972
  ;
RUN;
PROC PRINT data = temp;
  title 'Output dataset: TEMP';
RUN;

Output dataset: TEMP
Obs	l_name	f_name	weight	height
1	Smith	Alice	125	65
2	White	Maryann	140	68
3	Jones	Thomas	190	.
4	Arnold	Benedictine	190	68
5	Ho	Felicia	115	63

The INPUT statement uses @n absolute pointer controls to move the input pointer first to column 18 to read l_name, then back to column 6 to read f_name, forward to column 30 to read weight, and back again to column 27 to read height. This example illustrates, therefore, how you can use @n pointer controls to read data fields in any order ... backwards, forward, and backward again.

The 3. that appears after weight and the 2. that appears after height should look familiar. They are again the numeric informats that tells SAS how to read each of the two numeric variables of interest. Now, if you look at the informats for the two character variables, l_name, and f_name, you should see that they each begin with a dollar sign ($). Because we are trying to read in character data values, we use what is called the $w. informat. The w tells SAS the width of the raw data value, that is how many columns it occupies. The dollar sign ($) and period (.) are required delimiters. In our example, the last name (l_name) occupies as many as 6 columns, hence, the $6. informat, and the first name (f_name) occupies as many as 11 columns hence we use the $11. informat.

Okay ... launch and run the SAS program, and review the output obtained from the print procedure to convince yourself that the four variables were read in correctly.

Example 3.10

In addition to using @n absolute pointer controls with numeric and character informats, the following SAS program uses +n relative pointer controls with nonstandard informats to create a temporary SAS data set called temp containing six variables:

DATA temp;
  input @1  subj 4.
        @6  f_name $11.
		@18 l_name $6.
		+3 height 2.
        +5 wt_date mmddyy8.
        +1 calorie comma5.;
  DATALINES;
1024 Alice       Smith  1 65 125 12/1/95  2,036
1167 Maryann     White  1 68 140 12/01/95 1,800
1168 Thomas      Jones  2    190 12/2/95  2,302
1201 Benedictine Arnold 2 68 190 11/30/95 2,432
1302 Felicia     Ho     1 63 115 1/1/96   1,972
  ;
RUN;
PROC PRINT data = temp;
  title 'Output dataset: TEMP';
RUN;

Output dataset: TEMP
Obs	subj	f_name	l_name	height	wt_date	calories
1	1024	Alice	Smith	65	13118	2036
2	1167	Maryann	White	68	13118	1800
3	1168	Thomas	Jones	.	13119	2302
4	1201	Benedictine	Arnold	68	13117	2432
5	1302	Felicia	Ho	63	13149	1972

You should now understand the specifications for reading subj, f_name and l_name. The INPUT statement tells SAS to go to column 1 to read the numeric field subj that is four columns wide, then to go to column 6 to read the character field f_name that is 11 columns wide, and then to go to column 18 to read the character field l_name that is 6 columns wide.

The +n relative pointer controls are used to read the remaining three variables — height, wt_date, and calorie. The +n pointer control moves the input pointer forward to a column number that is relative to the current position of the pointer. That is, the + moves the pointer forward n columns. In order to count correctly, it is important to understand where the column pointer is located after each data value is read. In general, when using formatted input, the column pointer moves to the first column that follows the field that was just read.

Again, let's make it more concrete. When SAS finishes reading the l_name field, the column pointer moves to the first column that follows that field, that is, to the column that immediately follows the d in Arnold (column 24). Now, the height field begins 3 columns to the right, so that's why we tell SAS to move +3 before reading the height data. When SAS finishes reading the height field, the column pointer moves to the first column that follows that field, that is, to the empty column that follows the heights (column 29). Now, the wt_date field begins 5 columns to the right, so that's why we tell SAS to move +5 before reading the weight dates. When SAS finished reading the wt_date field, the column pointer moves to the first column that follows the field, that is to the empty column just before the calorie data (column 42). Now, the calorie field begins 1 column to the right, so that's why we tell SAS to move +1 before reading the calorie data.

We'd be all done explaining the INPUT statement if it weren't for the wt_date and calorie fields containing nonstandard numeric data. The wt_date field contains forward slashes (/) for specifying the dates, and the calorie data contains commas (,). To read the wt_date field, we use the mmddyy8. informat to tell SAS to expect dates written in mm/dd/yy form. The 8, of course, tells SAS that the dates could occupy as many as eight columns. To read the calorie field, we use the comma6. informat to tell SAS to read numeric data containing commas and occupying as many as 6 columns.

In general, the COMMAw.d informat is used to read numeric values and to remove embedded blanks, commas, dashes, dollar signs, percent signs, right parentheses, and left parentheses (which are converted to minus signs). The w. tells SAS the width of the field (including dollar signs, decimal places, or other special characters). The d, which is optional, tells SAS the number of implied decimal places for a value.

Whew!!! Now, go ahead and launch and run the SAS program, and review the output obtained from the print procedure. In doing so, note that formatted input does not imply formatted output. For example, note that SAS stores 12/01/95 as the number 13118. We won't worry about this too much yet — it's just that, interestingly enough, SAS stores dates as numeric values equal to the number of days that have passed since January 1, 1960. Yup, that's right —dates after January 1, 1960, are stored as unique positive integers, and dates before January 1, 1960, are stored as unique negative integers. Also, note that SAS prints the values of the variable calorie without commas. In order to get SAS to print the values in a more understandable format, we must tell SAS how to format the output by using a SAS FORMAT statement.

Type in the following statement exactly as it appears:

FORMAT wt_date mmddyy8. calorie comma5.;

between the INPUT statement and the DATALINES statement in your program. Rerun the SAS program with the now included FORMAT statement, and review the output from the print procedure to convince yourself that the data are now printed as desired. You should now see that it looks something like this:

Output dataset: TEMP
Obs	subj	f_name	l_name	height	wt_date	calories
1	1024	Alice	Smith	65	12/01/95	2,036
2	1167	Maryann	White	68	12/01/95	1,800
3	1168	Thomas	Jones	.	12/02/95	2,302
4	1201	Benedictine	Arnold	68	11/30/95	2,432
5	1302	Felicia	Ho	63	01/01/96	1,972

^[1]	Link
↥	Has Tooltip/Popover
	Toggleable Visibility