codEX Project
Normalizing as the first step to improve
Data Flow

Computers are responsible for data processing. At the beginning the input is defined. After some computation the output will be generated. An interactive application expects some string as input for example. This could be an integer value which is multiplied several times. During the output processing this new data is written into a file or printed on the screen. The data flow analysis respects this processing of data which is important to see which data is used and how.

The example used before to introduce the basics of MetaCode™ was using a very simple data processing structure. As part of the if control structure the existence and thruth of the $_GET['test'] was verified (003:004:026). If the statement is true, in this case if the GET variable is filled with a value unequal to FALSE or 0, the according code block from 005:013:066 to 006:015:095 is executed.

003:003:012 childopen
003:004:026 varget test
003:005:027 equal
003:006:031 childclose


Under some circumstances, which will be discussed in the chapter about the control structures, this input data is re-used within the semi-dynamic output (006:014:113):

006:012:080 output
006:013:081 string "Variable set to "
006:014:113 varget test


This data processing is very simple because no transpositional or substitutional influence for the data is realized. The content of the input data $_GET['test'] is always the same until the application run-time has finished. However, not every software is using such a simple data processing. Let's enhance the example with some data mutation at the beginning. The PHP function substr() is used to save the first byte of the input of $_GET['test'] in the variable $newtest only (003:002:017). The if conditional testing is also refering to the new variable (005:008:054). Even the final output procedure is still using the plain input (008:020:146):

001:001:001 .BEGIN_PHP:
003:002:017 var newtest
003:003:019 clone
003:004:027 childopen
003:005:040 varget test
003:006:047 data , 0, 1
003:007:047 childclose
005:008:054 if
005:009:055 childopen
005:010:063 var newtest
005:011:065 data
005:012:065 childclose
005:013:066 then
006:014:076 output
006:015:095 string "Variable set to 1"
007:016:099 else
007:017:104 then
008:018:114 output
008:019:132 string "Variable set to "
008:020:146 var newtest
009:021:150 fi
011:022:156 .END_PHP;


The data flow analysis has to determine the real dataflow if the input. We can say that something from the real input $_GET['test'] is used within the variable $newtest. Until we determine how the real data is copied, we can not say if the new data structure is possibly affected in any way. Further analysis of substr() as used function will show that only the first byte of $_GET['test'] is used. If the input was 123 then $newtest will hold the value 1 only. In this case it does not matter what the content besides the first byte was.

Assertions and Infection Paths

During the analysis every mutation of data needs to be veryified if there is a substitution or transposition for strings or numbers. For example if there is strrev() used to reverse a string the whole situation changes. The input 123 will be reversed to 321 and if afterwards substr() with the same conditions is used, the content of $newtest will be 3 instead of 1. This will change the whole control flow of the application as we will prove later.

To define the data mutation codEX need to know all common functions, their behaviour, influence and dependencies. An internal database hold all of them and the according attributes. This is a draft definition of substr() and strrev() mentioned before:

substr:
¦ Mutation: "yes if not param2 = len(param1) and not param1 = 0"
¦ Influence: all
¦ Result: Substring

strrev:
¦ Mutation: "yes if not strrev(input) equals input"
¦ Influence: all
¦ Result: Reversing


Furthermore the influence of arithmetic operators according to number values needs to be respected:

+:
¦ Mutation: yes
¦ Influence: numbers
¦ Result: Addition

-:
¦ Mutation: yes
¦ Influence: numbers
¦ Result: Subtraction

(...)


Such an intelligent data flow analysis is very hard to realize. This is because all common functions need to be addressed properly. As you can see in the example of substr(), there is a mutation not in any case. Whenever the full input string is defined as substring, the output will be the same as the input. Therefore, this will be the same as like the operator = is used as the following code example shows:

$result1 = substr($input, 0, strlen($input));
$result2 = $input;

if($result1 == $result2){
   echo "Equality given!";
}


The formal representation of the data flow analysis will explain the infection path. The absolute infection indicates how much of the data held is infected. The data is infected if it relies from an infected data source. Input data by external components, especially users, are infected always. This is why $_GET['test'] is infected as 100 %. And even $newtest which is holding one byte only, is infected 100 % because it re-uses an infected source data. If an internal variable is used, for example $i for a counter, no infection would be given:

1) 100%: $_GET['test']
2) 100%: $newtest
3)   0%: $i


The computation of the relative infection is much harder. The relativity is used to indicate the infection level of a variable according to the real dependency of the original data. The original data in this example is the user input $_GET['test'] which is infected 100 %. The generation of the variable $newtest uses the function substr() which used just some part of the original string. If the original string was 123 and the extraction of substr() uses the first byte only, then the relative infection of $newtest according to $GET['test'] is 33,3 % The internal variable $i is still independent and therefore not infected at all:

1) 100%: $_GET['test']
2)  33%: $newtest
3)   0%: $i


As you can see the generation of the relative infection severity requires a deep understanding of the dataflow and the involved functions. Thus, it is not easy to get the relative infection vector for such a highly dynamic example as like $newtest = strrev(substr($input, -1, strlen($input) / 2)).

Control Structures

Control structures are very important for the control flow of the application. They decide what kind of code blocks are addressed in which manner. The classic example is the if decision which we will discuss within the code example introduced at the beginning. If the condition is met (003:002:011), the defined codeblock is executed. Otherwise the further optional elseif decisions are requested. If none of them met, the optional else structure is used (005:010:065):

001:001:001 .BEGIN_PHP:
003:002:011 if
003:003:012 childopen
003:004:026 varget test
003:005:027 equal
003:006:031 childclose
003:007:032 then
004:008:042 output
004:009:043 string "Variable set to 1"
005:010:065 else
005:011:070 then
006:012:080 output
006:013:081 string "Variable set to "
006:014:113 varget test
007:015:116 fi
009:016:122 .END_PHP;


Regarding this MetaCode™ example the following addresses are declared as part of a control structure:

003:002:011 if
003:003:012 childopen
003:004:026 varget test
003:005:027 equal
003:006:031 childclose
003:007:032 then
005:010:065 else
005:011:070 then
007:015:116 fi


These elements are identified as control structure tokens. Regarding to them the depth of a control structure is notes. For example in 003:002:011 the code enters the first control structure as if which is left in 005:010:065 for else and re-entered in 005:011:070. The final exit of the whole structure is the last fi which is found in 007:015:116. In the same way other control structures as like while, until, for or case are handled. The slicing representation of this structure is documented as follows:

001:001:001 *(0) .BEGIN_PHP:
003:002:011 +(1) if
003:003:012 *(1) childopen
003:004:026 *(1) varget test
003:005:027 *(1) equal
003:006:031 *(1) childclose
003:007:032 *(1) then
004:008:042 *(1) output
004:009:043 *(1) string "Variable set to 1"
005:010:065 -(0) else
005:011:070 +(1) then
006:012:080 *(1) output
006:013:081 *(1) string "Variable set to "
006:014:113 *(1) varget test
007:015:116 -(0) fi
009:016:122 *(0) .END_PHP;


A graphical representation of this behaviour is shown in the following illustration. As you can see the primary condition A is affecting the use of the following code blocks alpha and beta. The condition level of this simple code swaps between 0 and 1.



This can be reduced to the following entry points of control structures, where only the changing elements are accepted as influence of control behaviour:

003:002:011 +(1) if
005:010:065 -(0) else
005:011:070 +(1) then
007:015:116 -(0) fi


This makes it possible to do model checking on the basis of a control flow diagram. The Kripke structures and the computation trees (Computational Tree Logic) are able to determine all the states (start points, flows and end points) [Kripke 1963]. Furthermore the classification of finite and infinite traces is possible during process of further abstraction (trace semantics to relational semantics to natural semantics is able to determine the escape semantics of the application). Due to no loops and recusrive functions are used, the application has a finite set of states and will terminate for sure. Thus, the prepositions for reaching the first block Y within the if decision is handled as A between 003:002:011 and 003:007:032:

003:003:012 childopen
003:004:026 varget test
003:005:027 equal
003:006:031 childclose


Therefore only of condition A is met, code block Y is executed. The extraction of the decision structure looks like this:

003:002:011:
¦ Type: if
¦ Condition: varget test = 1
¦ Change: +
¦ Level: 1


If this condition is not met, the else condition comes up and brings the execution path to 005:011:070. This leads to the assumption that the output of "varget test" is only done if the if decision is false. Therefore if "varget test" is not equal 1 the output can be generated.

Therefore, the following results within real-world testing could be determined. For the representation the classic Hoare triple notation of the form {P}C{Q} is used [Hoare 1969]. The P stands for the precondition (input), C is the application or function and Q is the postcondition (output):

1) {"1"}C{S1}
2) {"2"}C{S2."2"}
3) {"a"}C{S2."a"}
4) {"<"}C{S2."<"}
:  :   :
:  :   :
n) {*}C{S2.*}


This is a very good example for an application which is vulnerable to HTML injection and cross site scripting. Because the input is used within the output further manipulation of the latter is possible.

Cross Site Scripting: {"<script>alert('a')</script>"}C{S2."<script>alert('a')</script>"}


Bilbiography

Hoare, C.A.R (1969), "An Axiomatic Basis for Computer Programming", Commun. ACM 12(10): 576-580 (1969)

Kripke, S.A. (1963), A Semantical Analysis of Modal Logic I: Normal Modal Propositional Calculi, Zeitschrift für Mathematische Logik und Grundlagen der Mathematik, 67-96