4.5 - Converting Data
4.5 - Converting DataSuppose you are asked to calculate sales income using the price and the number of units sold. Pretty straightforward, eh? As long as price and units are stored in your data set as numeric variables, then you could just use the assignment statement:
sales = price * units;
It may be the case, however, that price and units are instead stored as character variables. Then, you can imagine it being a little odd trying to multiply price by units. In that case, the character variables price and units first need to be converted to numeric variables price and units. How SAS helps us do that is the subject of this section. To be specific, we'll learn how the INPUT function converts character values to numeric values.
The reality though is that SAS is a pretty smart application. If you try to do something to a character variable that should only be done to a numeric variable, SAS automatically tries first to convert the character variable to a numeric variable for you. The problem with taking this lazy person's approach is that it doesn't always work the way you'd hoped. That's why, by the end of our discussion, you'll appreciate that the moral of the story is that it is always best for you to perform the conversions yourself using the INPUT function.
Example 4.11
The following SAS program illustrates how SAS tries to perform an automatic character-to-numeric conversion of standtest and e1, e2, e3, and e4 so that arithmetic operations can be performed on them:
DATA grades;
input name $ 1-15 e1 $ e2 $ e3 $ e4 $ standtest $;
avg = round(mean(e1,e2,e3,e4),1);
std = standtest/4;
DATALINES;
Alexander Smith 78 82 86 69 1,210
John Simon 88 72 86 . 990
Patricia Jones 98 92 92 99 1,010
Jack Benedict 54 63 71 49 875
Rene Porter 100 62 88 74 1,180
;
RUN;
PROC PRINT data = grades;
RUN;
Okay, first note that for some crazy reason, all of the data in the data set have been read in as character data. That is, even the exam scores (e1, e2, e3, e4) and the standardized test scores (standtest) are stored as character variables. Then, when SAS goes to calculate the average exam score (avg), SAS first attempts to convert e1, e2, e3, and e4 to numeric variables. Likewise, when SAS calculates a new standardized test score (std), SAS first attempts to convert standtest to a numeric variable. Let's see how it does. Launch and run the SAS program, and before looking at the output window, take a look at the log window. You should see something that looks like this:
OPTIONS NONOTES NOSTIMER NOSOURCE NOSYNTAXCHECK; DATA grades; input name $ 1-15 e1 $ e2 $ e3 $ e4 $ standtest $; avg = round(mean(e1,e2,e3,e4),1); std = standtest/4; DATALINES; NOTE: Character values have been converted to numeric values at the places given by: (Line):(Column). 71:19 71:22 71:25 71:28 72:8 NOTE: Invalid numeric data, standtest='1,210' , at line 72 column 8. RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0 Alexander Smith 78 82 86 69 1,210 name=Alexander Smith e1=78 e2=82 e3=86 e4=69 standtest=1,210 avg=79 std=. _ERROR_=1 _N_=1 NOTE: Invalid numeric data, standtest='1,010' , at line 72 column 8. Patricia Jones 98 92 92 99 1,010 name=Patricia Jones e1=98 e2=92 e3=92 e4=99 standtest=1,010 avg=95 std=. _ERROR_=1 _N_=3 NOTE: Invalid numeric data, standtest='1,180' , at line 72 column 8. Rene Porter 100 62 88 74 1,180 name=Rene Porter e1=100 e2=62 e3=88 e4=74 standtest=1,180 avg=81 std=. _ERROR_=1 _N_=5 NOTE: Missing values were generated as a result of performing an operation on missing values. Each place is given by: (Number of times) at (Line):(Column). 3 at 72:17 NOTE: The data set WORK.GRADES has 5 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 0.00 seconds user cpu time 0.00 seconds system cpu time 0.00 seconds memory 708.81k OS Memory 20388.00k Timestamp 05/15/2023 12:13:39 PM Step Count 24 Switch Count 2 Page Faults 0 Page Reclaims 199 Page Swaps 0 Voluntary Context Switches 10 Involuntary Context Switches 0 Block Input Operations 0 Block Output Operations 264
The first NOTE that you see is a standard message that SAS prints in the log to warn you that it performed an automatic character-to-numeric conversion on your behalf. Then, you see three NOTES about invalid numeric data concerning the standtest values 1,210, 1,010, and 1,180. In case you haven't figured it out yourself, it's the commas in those numbers that is throwing SAS for a loop. In general, the automatic conversion produces a numeric missing value from any character value that does not conform to standard numeric values (containing only digits 0, 1, ..., 9, a decimal point, and plus or minus signs). That's why that fifth NOTE is there about missing values being generated. The output itself:
Obs | name | e1 | e2 | e3 | e4 | standtest | avg | std |
---|---|---|---|---|---|---|---|---|
1 | Alexander Smith | 78 | 82 | 86 | 69 | 1,210 | 79 | . |
2 | John Simon | 88 | 72 | 86 | 990 | 82 | 247.50 | |
3 | Patricia Jones | 98 | 92 | 92 | 99 | 1,010 | 95 | . |
4 | Jack Benedict | 54 | 63 | 71 | 49 | 875 | 59 | 218.75 |
5 | Rene Porter | 100 | 62 | 88 | 74 | 1,180 | 81 | . |
shows the end result of the attempted automatic conversion. The calculation of avg went off without a hitch because e1, e2, e3, and e4 contain standard numeric values, whereas the calculation of std did not because standtest contains nonstandard numeric values. Let's take this character-to-numeric conversion into our own hands.
Example 4.12
The following SAS program illustrates the use of the INPUT function to convert the character variable standtest to a numeric variable explicitly so that an arithmetic operation can be performed on it:
DATA grades;
input name $ 1-15 e1 $ e2 $ e3 $ e4 $ standtest $;
std = input(standtest,comma5.)/4;
DATALINES;
Alexander Smith 78 82 86 69 1,210
John Simon 88 72 86 . 990
Patricia Jones 98 92 92 99 1,010
Jack Benedict 54 63 71 49 875
Rene Porter 100 62 88 74 1,180
;
RUN;
PROC PRINT data = grades;
var name standtest std;
RUN;
The only difference between the calculation of std here and that in the previous example is that the standtest variable has been inserted here into the INPUT function. The general form of the INPUT function is:
INPUT(source, informat)
where:
- source is the character variable, constant or expression to be converted to a numeric variable
- informat is a numeric informat that allows you to read the values stored in source
In our case, standtest is the character variable we are trying to convert to a numeric variable. The values in standtest conform to the comma5. informat, and hence its specification in the INPUT function.
Let's see how we did. Launch and run the SAS program, and again before looking at the output window, take a look at the log window. You should see something that now looks like this:
DATA grades;
input name $ 1-15 e1 $ e2 $ e3 $ e4 $ standtest $;
std = input(standtest,comma5.)/4;
DATALINES;
NOTE: The data set WORK.GRADES has 5 observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
;
RUN;
PROC PRINT data = grades;
NOTE: Writing HTML Body file: sashtml.htm
var name standtest std;
RUN;
NOTE: There were 5 observations read from the data set WORK.GRADES.
NOTE: PROCEDURE PRINT used (Total process time):
real time 0.51 seconds
cpu time 0.34 seconds
Ahhaa! No warnings about SAS taking over our program and performing automatic conversions. That's because we are in control this time! Now, looking at the output:
Obs | name | standtest | std |
---|---|---|---|
1 | Alexander Smith | 1,210 | 302.50 |
2 | John Simon | 990 | 247.50 |
3 | Patricia Jones | 1,010 | 252.50 |
4 | Jack Benedict | 875 | 218.75 |
5 | Rene Porter | 1,180 | 295.00 |
we see that we successfully calculated std this time around. That's much better!
A couple of closing comments. First, I might use our discussion here to add another item to your growing list of good programming practices. Whenever possible, make sure that you are the one who is in control of your program. That is, know what your program is doing at all times, and if it's not doing what you'd expect it to do for all situations, then rewrite it in such a way to make sure that it does.
Second, you might be wondering "geez, we just spent all this time talking about character-to-numeric conversions, but what happens if I have to do a numeric-to-character conversion instead?" Don't worry ... SAS doesn't let you down. If you try to do something to a character variable that should only be done to a numeric variable, SAS automatically tries first to convert the character variable to a numeric variable. If that doesn't work, then you'll want to use the PUT function to convert your numeric values to character values explicitly. We'll address the PUT function in Stat 481 when we learn about character functions in depth.