Relational Databases

Relational databases have been used by businesses for decades and they have steadily developed the features required for large-scale implementation, including scalability (in terms of user count and database size). The idea of relational databases was first suggested by Edgar Codd in 1969, and by the nineties, they had largely replaced hierarchical and network databases for business purposes. But the original definition is excruciatingly abstract, and no Codd-standard “true” relational database has ever been commercially released, although there are academic implementations. Textbook discussions of relational databases descend into pure math almost immediately. For practical purposes, relational databases are databases that can be queried using SQL, but purists point out that the current ISO standard definition of SQL does not mention the relational model and ISO SQL deviates from the relational model in several ways.

Roughly speaking, a relational database is a set of data connected by relations. To take a simple example, an invoicing system could have one table with a list of invoices and another with a list of customers. Each table has one record (referred to as a tuple) per line, and defines a relation – for example, the invoice number is related to the date the invoice was issued. In fact, relational tables are often referred to as “base relations”.

But there are other relations in the database. The data in the two tables mentioned above are also related to each other by customer numbers, which appear on each line of both tables. To see how much a given customer bought in sum, filter the invoice table by the customer number and sum the total column of the result. To get the address of the customer for a given invoice, filter the customer table by the customer number in the invoice and look at the address column. These are derived relations, more commonly known as views or queries.

Relational databases are now so widespread that it is hard to imagine a modern company without one. But it is not just the details of the relational model which give them such a firm position. The ubiquity of technical skills required to administer relational databases and a huge number of applications that use them guarantees that relational databases will be used for many years in the future. Another strength of relational databases is the widespread use of SQL, which makes it easy for an application to work with multiple relational databases. This was not true of hierarchical databases. Establishing SQL as a standard probably led to the development of the whole ERP industry and the rest of the OLTP applications market.

One of the great advantages of relational databases is their ability to support transactions. The physical storage of relational data on the disk is usually by row. Row-based databases are better for short rows and situations where all columns are needed in most queries. This applies to transactions, which often write all the columns of a new row at the same time. However, it is worth noting that key parts of transactions are actually add-ons to the basic relational model and are achieved in the following steps:

  • Lock affected areas of the database until the transaction is complete
  • Log all changes to the database in a separate log file
  • Write back the result immediately
  • Roll back if the transaction fails by undoing the changes recorded in the log.

Relational databases used for large scale business environments have highly refined versions of these extra features.

A variety of other features have clustered round the relation concept and allowed the databases to be used in many different contexts and to scale almost unimaginable heights. Strictly speaking, these additions have nothing to do with the relational model and, indeed, non-relational databases also have similar capabilities. So the relational model owes much of its success to the development of external concepts.

  • Highly developed indexing methods: Indexing sounds boring, but it has been a rich field for technical innovation over the past few decades. Most modern relational databases use multiple indexes simultaneously and indices often consume more storage space than the raw data in the database. Part of the database’s internal query optimization is an automatic on-the-fly selection of which index to use. Individual index types tend to work best in specific use cases. For example, bitmap indices are mostly commonly used for analysis on tables with many columns of yes/no data. They are not much use in other applications.
  • Redundancy: Another feature of modern relational databases is the ability to spread data storage across multiple servers, and provide redundancy for safety.
  • Multiple interfaces: Relational databases come with their own proprietary interfaces, but there are also standard interfaces to allow third-party products to access them generically. Providing multiple technical interfaces, all of which support SQL, is another way of opening up the database to as many third-party products as possible. The most common interfaces are the Windows-based ODBC and the Java-based JDBC. As a rule, the native interfaces provide the best performance.
  • Stored procedures: Stored procedures were originally introduced by individual vendors and are now part of the SQL standard as optional features. They are stored in the database and are fragments of procedural code that can be called from the client and executed on the server. They allow highly complex applications to run on the database server itself. Unlike SQL itself, which is reasonably well standardized, the syntax for procedures to manipulate data is completely different in every database.
  • Large-scale databases also take good advantage of multi-threaded servers, so processing loads can be spread across multiple CPUs.

Other types of database also have these features, and relational databases are not the only type around. IBM IMS, a hierarchical database, is probably still the leading OLTP database, and is reputed to be IBM’s biggest billing software product. IDMS is also still around, as well as Adabas, Model 204, and other non-relational databases. Of course, some of these now support SQL queries, even though the underlying storage model is not relational.

One of the weaknesses of relational databases is the way that related data tends to be spread around the database. Even the simplest invoicing system would require four tables to be useful: a customer table, an invoice header table, a product table and an invoice line table. The latter is needed because each invoice can have multiple lines. Printing the invoice means filtering all four tables and combining the results into a single report. And real-world applications are much more complicated than that.

Normalization, which results in splitting the data up into small tables, is the usual mode for storing relational data. Normalized databases are optimized to avoid inconsistencies in transactional systems. But data that is spread around this way is not at all easy to analyze by hand. Most applications based on relational databases provide the user with highly abstract views of the data, which make sense from a business perspective but are very difficult to see in the underlying data. The queries needed to analyze transactional data tend to be extremely complicated because related data is distributed around the database as a result of the normalization. The queries are typically difficult to define, requiring multiple joins and a complete understanding of the complexity of the schema. They are also quite slow, because the database has to join the data from the various tables back together to answer the query.

Another problem with analyzing transactional data directly is that the data model is not designed to return aggregated views such as the sum of all the transactions in a given time period. But this kind of aggregation is typical for analysis. Attempting to carry out analysis directly on data in this format results in poor performance. It should never be attempted on anything but the smallest transactional systems.

The data structures in applications built on relational databases are sometimes very complex. Large ERP systems can contain tens of thousands of tables. Vendors of business intelligence software commonly say they support various relational databases, but they rarely mention the fact that without semantic support the ability to connect to a complex relational database is almost useless. In other words, if the software that is extracting data does not know where to find the data in the right form, then the user will have to know where the data is, and build queries by hand. This issue can lead to significant cost increases in business intelligence projects.

To make analysis of transactional data feasible, many companies create data warehouses. This involves extracting data from its original source, transforming and cleansing it and finally loading it into a new database in a form that is suitable for analysis. Data warehouses are usually stored in relational databases and are often the first place in the enterprise where basic analytic concepts such as a time axis appear. The data management processes involved can be quite complex and often involve exporting the data to large text files and re-importing it into another database. The issue of data quality is particularly vexing. In many cases the data in the warehouse comes from multiple, otherwise unconnected, sources.

However, data warehouses can be very large and unwieldy. To make analysis practical, small subsets of the data are often extracted into so-called data marts for analysis purposes. Data marts are sometimes stored in relational databases, but in many cases other forms of database are used, particularly multidimensional databases. The idea here is to trade off details and complexity for speed and simplicity. Attempting to retain the complexity of the underlying ERP system in the data mart is trying to have your cake and eat it too, and is probably doomed from the start. Well-designed data marts always lack most of the complexity of the transactional world they are based on. They are optimized for accessibility, and usually hold aggregated data only. Because they feature reduced size and complexity, data marts can often deliver query results very quickly, making them popular with business users.

General-purpose relational databases designed for enterprise transactional systems are not always the best choice for storing data warehouses. The major vendors of relational databases have tried to optimize their products for this market, but face stiff competition from a relatively new technology in this area – data warehouse appliances, which offer a combination of hardware and software for high speed performance on very large quantities of data, sometimes even in the petabyte range. Data warehouse appliances use parallel architecture with optimized custom hardware and operating systems to deliver performance. Some of the systems also come with chips built into the disk controller that carry out simple filters on the data as it comes from the disk. In these systems only a fraction of the data needs to be loaded and processed by the server itself, and appliances typically make much less use of complex indices, which means they also store data much more compactly. These innovations allow appliances to process huge quantities of data on relatively cheap custom hardware.

The physical data storage model also varies from database to database. Again, this is an implementation detail and, strictly speaking, it has nothing to do with the relational model itself. The standard way to store data in transactional databases is to store by row – all the data in the first row of the table, followed by all the data in the second row, and so on.

In columnar databases, the data is stored by column, not by row, and this has an effect on how the database is used. Storing by column makes typical transactional operations such as writing or reading every column of a single row from a large database slow, so columnar databases are not well adapted to typical transactional systems. To see why, consider a command to add a new invoice to a system. It adds a new row to the invoice table, so it would require accessing and changing all the columns of the entire table, because the table is stored by column. So typical transactional operations are fairly clumsy on columnar databases. But reading one or two columns from every row in a table is fast using a columnar database. Reading a few columns and ignoring the rest is typical for OLAP operations. Because columnar databases allow you to ignore columns they are better for databases with large numbers of columns and allow separate indexes per column for optimal access speeds. They are more efficient for adding new values of a column for all rows because only one column is touched.

Most columnar databases support SQL. The distinction between row-based and columnar databases is in the storage engine. The queries may use SQL and optimize to the storage engine, for executing the actual write and read actions. In fact one columnar database, InfoBright, is actually delivered as a database engine for MySQL and accessed using SQL queries in the MySQL dialect.