You are here

Semantic Data Language

Intro

Johan ter Bekke developed a new and efficient data language (more suitable for end users than SQL) respecting data model structure and avoiding the pitfalls (joins and NULLs) of SQL. Here is an overview of the main data query and manipulation commands:

commanddescription
 getto retrieve data
 extendto derive/compute intermediate data
 valueto store one (possibly derived) value
 insertto add a record
 updateto change data
 cascadeto process data in a recursive way
 deleteto remove data

These commands have been implemented in the Xplain DBMS, see Xplain Manual usage examples (in Dutch).

For a complete overview of the semantic data language syntax see Syntax of the Xplain data language (file in pdf-format).

Relational join pitfall

Before discussing semantic data language principles it is essential to discuss a main weakness of the relational language SQL leading to a pitfall, especially for non-expert users. In particular we discuss the problems generated by the relational join operation when combined with GROUP BY. In order to be able to discuss that pitfall we use the following simple data model:

employee (emp#, name, address, town, birth_date, function, dept#);
department (dept#, address, town, main_task);

The semantic equivalent of this model:

type employee = name, address, town, birth_date, function, department.
type department = address, town, main_task.

Improved employee department modeling

We suppose that this data model is used by an organization located in Amsterdam wanting to determine which departments (dept#) have the smallest number of employees living in Amsterdam. This information can be relevant when departments offer a compensation for the travelling costs of their employees. A user could think that this problem is solved by the following SQL-query:

CREATE VIEW temporal (dept#, number) AS
SELECT department.dept#, COUNT (emp#)
  FROM employee, department 
 WHERE employee.dept# = department.dept#
   AND employee.town = Amsterdam
 GROUP BY department.dept#;
SELECT dept#
  FROM temporal
 WHERE number IN (SELECT MIN (number) FROM temporal);

If a department does not have any employee living in Amsterdam, its dept# will not be present in the temporal table created by the view temporal. A more extended query solves this problem, using a union of two sub sets:

CREATE VIEW temporal (dept#, number) AS
(SELECT department.dept#, COUNT (emp#)
   FROM employee, department 
  WHERE employee.dept# = department.dept#
    AND employee.town = Amsterdam
  GROUP BY department.dept#)
 UNION
(SELECT dept#, 0
   FROM department
  WHERE dept# NOT IN (SELECT dept# 
                        FROM employee 
                       WHERE town = Amsterdam));
SELECT dept#
  FROM temporal
 WHERE number IN (SELECT MIN (number) FROM temporal);

The last solution guarantees that if a department does not have any employee living in Amsterdam the value of the derived attribute temporal.number is 0 for such a department. Now the derived information will be correct (and complete) irrespective the contents of the database.

Semantic data language

The semantic language does not allow for join-operations; it requires applying paths really existing in the underlying data model. The first step is to derive the required number of employees per department, using the temporal attribute department its number:

extend department with number = count employee
        where town = Amsterdam
        per department.

value minimum = min department its number.

get department where number = minimum.

The term per means the same as for each, which most probably is easier to understand then GROUP BY: if an empty sub set of data is involved in a join combined with the GROUP BY construct then an incomplete, thus incorrect query result is possible! Here we have the strange situation that a seemingly semantically correct SQL-query produces a result of which the correctness depends on the actual contents of the database!

In the semantic approach NULL-values do not occur in the database. What would be the query result of counting Amsterdam-employees per department if we allow that the value of the attribute employee its department can be NULL? Moreover, allowing NULL-values would lead to many problems because in many queries we have to walk through data paths existing in the underlying data model. The above mentioned extend operation applies a path consisting of only one attribute: employee its department. However, longer paths may be used as in the following example where the path employee its department its town is involved in a query about commuters:

get employee its name, address, department, town, department its town
        where not town = department its town.

In many cases users want to retrieve information about data sets, but it is also possible and simple to get information about some specific instance, for example an employee with the identification 233:

get employee 233 its name, address, town, department.

Examples of applying insert or update or delete commands can be found in the books of Johan ter Bekke.

Recursive data processing

The last fundamental contribution of Johan ter Bekke was the design and implementation of an efficient cascading update command for recursive calculations (shortest path, critical path, reachability) in acyclic networks. This cascade command is based on graph reduction and guarantees termination: if the graph contains no cycle a correct answer will be generated. If there is any cycle, complete graph reduction is impossible and the user is informed about the nodes participating in a cycle. The papers of DBA 2004 (Innsbruck, 2004) show how the transformation of cyclic geometric graphs into acyclic time graphs extends the application area of the cascade command. In this way a generic solution is available for transitive closure problems such as finding the fastest journey between two airports irrespective the number of intermediate transfers.