Articles / SQL Databases for Linux

SQL Databases for Linux

So you find you need to store some data on your Linux system, and are wondering what program to use... There are a vast array of database systems available for use on Linux. Some are simple, some sophisticated, some cheap, some expensive. One of the first things you ought to do is ask yourself what you need to do with the database, as that is critical to bringing the number of choices down from "stupendously bewildering" to merely "astoundingly many."

Different database systems offer substantially different ways of storing and retrieving information, and deciding what to use will require asking yourself questions about storage and retrieval.

This set of articles tries to explore this by looking at three "classes" of database systems. This article will deal with SQL Databases. A followup article will discuss xBase descendants and Keyed Tables like DBM.

One of the most common database questions that come up is "Where is the MS Access clone?"

One thing that should be made clear is that there isn't such a thing. "Experts" tend to regard this as a good thing, as MS Access tries to do too many things all at once to be really particularly successful at them all.

MS Access combines:

  • Data storage and retrieval,
  • A multiplexing data access system,
  • A report writer,
  • And a form generator for data entry and other interactive data manipulation.

Most of the database tools that will be described here primarily focus on the first area. By doing just storage and retrieval, reliability certainly increases.

Mind you, there would be considerable merit in having a single integrated environment for data access, writing reports, and providing GUIed access to update data. There are ongoing projects to provide those sorts of capabilities, but none are yet quite as "GUI-pretty and newbie-friendly" as MS Access.

SQL Databases for Linux

There are a lot of database systems that run on Linux using SQL data access schemes.

These databases are often fairly "heavyweight", requiring considerable disk and memory resources and providing data access capabilities of considerable sophistication.

Since they provide a (somewhat) common query language, a suitably-designed application may be able to be readily ported to run on different database systems, thus allowing the ability to take advantage of differing performance properties, as well as the ability to not be forcibly dependent on any one vendor.

Unfortunately, virtually all SQL database systems offer one "extension" or another that tends to tempt developers to tune their applications specifically for one database system.

One of the benefits of the use of a reasonably "abstract" query language is that the database engine can do a lot of work for you. For instance, rather than having to write code (adding temporary variables, loop structures, and such) to access several tables, you may construct a more complex SQL query that lets the database engine join tables together for you. The hopes typically expressed are that:

  • You avoid writing the code to "physically" access the data, thus saving time, energy, and debugging effort.
  • The DBMS may do a better job of optimizing the accesses than you would, and you get to avoid the debugging effort, similarly to the way C may be more productive for programming than assembly language.
  • If performance is poor due to the DBMS not having suitable table indexes, you can add an index and watch performance improve substantially without having to write any code. Similarly, adding some memory for table buffering can improve performance without your touching a single line of code.
  • In most cases, the database server processes support transaction capabilities, improving reliability.

In practice, software bugs are sufficiently ubiquitous that you'll still need to do some debugging, and there are certainly some overheads in terms of the cost of parsing queries and submitting them to the DBMS engine.

Other notable merits of SQL database systems include:

  • They offer the ability to readily add additional kinds of data to the system.
  • They almost always offer the ability to permit multiple users and/or processes to have robust concurrent access to data.
  • Most offer a considerable degree of transactional robustness typically not offered by less sophisticated alternatives through the four Transaction Processing requirements, called ACID:
    Atomicity

    All transactions are either performed completely (committed), or are not done at all; a partial transaction that is aborted must be rolled back.

    Consistency

    The effects of a transaction must preserve required system properties. For instance, if funds are transferred between accounts, a deposit and a withdrawal must both be committed to the database, so that the accounting system does not fall out of balance.

    In double-entry accounting, the "staying in balance" property is usually not overly difficult to maintain. The more thorny issue comes when the property is something like "Cash Balance Cannot Drop Below Zero", or "We can't ship inventory we don't have." In such cases, if you have two transactions being submitted concurrently, it could be that either could be accepted, but not both. If one of the transactions would cause balance requirements to be violated, the transaction management system needs to reject one of the transactions.

    Isolation

    Intermediate stages must not be made visible to other transactions. Thus, in the case of a transfer of funds between accounts, both sides of the double-entry bookkeeping system must change together for each transaction. This means that transactions appear to execute serially even if some of the work is done concurrently.

    Durability

    Once a transaction is committed, the change must persist, except in the face of a truly catastrophic failure.

    If Mongol hordes ride through and lay waste to your server room (or, sadly more likely, an unfortunate plane crash takes place), you can hardly expect a transaction system to guarantee that all is well, but a good transaction processing system should be resistant to moderately traumatic sorts of system failures such as a network link breaking down or perhaps even something as traumatic as a disk drive malfunctioning.

There are really a lot of SQL database systems available to run on Linux. At one time, there was great excitement at the thought that some people at Oracle had an internal port they were fiddling with; today, almost any database vendor that offers a version on some Unix variation sells licenses for Linux. Pretty much the only major database vendor that doesn't deploy a version on Linux is Microsoft. The top tier industry names, Oracle, Sybase, Informix, and IBM DB/2, are all available on Linux.

Development licenses are typically available inexpensively or even for free, but production licenses tend to be quite expensive. These systems tend to be "heavyweights" in terms of feature sets, use of memory and disk, and licensing costs. They provide robust access to large amounts of data, at considerable price; if you are building an "enterprise" system, they are the common choices.

They are not suitable for every purpose; other database systems often offer superior characteristics in one area or another.

  • If you want an "Open Source" database system, a number are available.
  • There are a number of commercial databases specifically targeted as "embedded systems".

    These database systems tend to put the data in a single compact location, whether that be in files in a single directory hierarchy, or even in a single file.

    If the database is to be used as part of an application, it is attractive if the data stays highly localized in contrast with some of the DBMSes with which, to maximize speed and robustness, the system may manage raw disk partitions.

  • In-memory databases take advantage of huge amounts of RAM and try to provide particularly fast query and analysis capabilities.
  • There are also other special purpose databases providing more advanced capabilities for text searching and the like.

Open Source SQL DBMSes

A number of database systems are available under Free Software licenses such as the GPL. Most notable are MySQL and PostgreSQL, which are available pre-packaged for many Linux distributions, and which are widely used to support Web-based applications.

There are a number of others; they include "toy" databases as well as some that used to have proprietary licenses. Firebird was once Borland InterBase, and SAPDB was once called Adabas-D. These may become of greater interest in the future, once "systems integration" efforts get further along, but they are not yet being widely integrated with Linux distributions or with Free applications.

  • MySQL
  • PostgreSQL
  • Firebird
  • InterBase
  • SAP
  • SQLite -- an SQL database implemented atop GNU gdbm.
  • Gadfly -- an SQL Relational Database in Python
  • kshSQL DBMS -- an "SQL simulator" written using the Korn shell. Performance is doubtless not going to be spectacular, but if it's easy to install, it might be useful for some applications.
  • GNU SQL
  • Beagle SQL -- formerly at http://www.beaglesql.org/, but this software hasn't been sighted in the wild in a couple of years...

Embedded SQL DBMSes

"Embedded" databases are intended to be embedded in applications, and tend to be designed with a view to being easy to install and to require little, if any, attention to administration or tuning.

In-Memory SQL DBMSes

These database systems store data primarily in memory, as compared to more traditional architectures that involve "paging" data from disk as needed. At first glance, this would appear a crippling reduction in robustness, but reality lies elsewhere, as nothing prevents these systems from being tremendously "paranoid" in logging updates to more permanent storage.

The point of the exercise here is not so much to provide robustness as it is to take advantage of the fact that memory space on modern computers has grown astoundingly. These systems assume that enough physical RAM is available to hold the entire database, which means that queries proceed without worrying about what is or isn't cached.

In-Memory databases should be particularly useful for applications like data warehousing, or for providing fast responses for things like catalog queries.

Miscellaneous SQL DBMSes

There are many, many commercial database systems available on Linux; it is quite difficult to distinguish, for many of these systems, why they should be considered interesting, as many represent "Yet Another SQL DBMS conforming to some reasonable set of standards". Some have, as "claims to fame", the ability to do sophisticated text-oriented queries, or integration between a database server and a Web server.

Data Access

It's not much use having a neat new database system if you have no way of querying or updating information in that database, working in some language you find desirable to work in.

There are a number of common ways to access data in an SQL database. Some are standardized to the point that it is not difficult to plug in a different database system as needed; others are less so.

ODBC or SQL/CLI
There are standards for this "gateway into SQL"; ODBC seems to commonly be regarded as a Microsoft-only sort of "database driver" for Windows, but there do exist ODBC implementations for Linux that essentially provide an API designed for use in C. Since many other languages interface readily with C libraries, this is fairly widely usable to allow access to databases.
JDBC
JDBC is fairly consciously targeted as a sort of "ODBC for Java". It provides a set of Java APIs for accessing databases, and with the widespread use of Java for Web application servers, many database makers provide JDBC "drivers".
ESQL
This stands for "Embedded SQL". Database systems often used to use this as a way to help write database applications in C. You would write programs mostly in C, with SQL queries embedded as needed. The C program would then be run through an "ESQL processor" that would transform the embedded SQL queries into C code to express those queries.

This should allow programs to be written more compactly and to be more efficient than would be the case with SQL/CLI. Unfortunately, this requires work in two languages simultaneously, and tends to be somewhat nonportable.

Perl DBI
The idea behind Perl DBI is to provide a high-level set of functions for manipulating databases that work for many different databases, and then to link those functions to database-specific code (perhaps using ODBC or some database-specific library typically implemented in C).

Aficionados of Python or Tcl can find similar sorts of libraries; the somewhat more widespread popularity of Perl means that there are somewhat more options for Perl.

Performance

Note that this discussion has not even tried to address the issue of which database system is "fastest."

The problem is that evaluating this is a really daunting problem. People commonly claim that one DBMS or another is much faster than the others. Unfortunately, this sort of thing is very difficult to evaluate in a scientific manner. We might find that MySQL was providing vastly better performance than PostgreSQL, until an extra table key or buffer was added, at which point the tables might turn. Or you might find that PostgreSQL has some feature to support your application that no other database system offers, so it is the only option that actually performs acceptably.

Unfortunately, when it comes to benchmarks, everyone is pretty partisan. Database vendors have often been known to send developers out to work with hardware vendors to tweak performance on industry benchmarks, and there is some indication that vendors have even tuned database engines to be specifically aware of certain benchmarks, in much the same way that compiler vendors were once accused of writing code to recognize the Byte Magazine Prime Number Benchmark, and then generate hand-tuned assembly language. Faking a benchmark like that obviously goes well past what is reasonable, but things gets less clear if you find that using some vendor-specific extension dramatically improves performance on some part of a benchmark.

The way that locking is used can be easily cited as a place where performance will vary; a major merit of the more sophisticated database systems is that they cope well with having many users working with and modifying data concurrently. Correctly handling that requires doing some locking of data against modification. There are several different granularities of locking, with different costs and benefits:

  • A table or whole database lock is really cheap to implement, but is rather "antisocial", as it means that only one user/process is granted access, and all others are blocked.

    There are times when this is reasonable or even necessary, particularly when data conversions or cleanup are underway, but if a system has a lot of users trying to do updates, they'll get extremely irritated at being blocked from doing their work.

  • At the other end of the spectrum is the "row lock", in which what is locked is one row in one table.

    This is definitely a lot less "antisocial" than the table lock; anybody being blocked by a row lock is likely trying to modify the very same bit of data as someone else.

    Unfortunately, the DBMS has to manage and track these locks, so behind the scenes, there will be a (probably invisible to the user) lock table with one entry for each row that is locked. If there are a lot of table rows being worked on, that's a lot of locks!

  • Sybase was long criticized for not offering row locking; they instead offered page locking, which is an intermediate between row and table locking.

    In many database systems, space is allocated to tables in "pages", where a page is a fixed block (often 2K) of storage devoted to rows in a specific table. If each row is 80 bytes, a 2K page would hold about 25 rows.

    Sybase would lock the whole page, which is less "antisocial" than locking the whole table, but if you're merely modifying one row, the lock might affect 25 times as much data as is necessary.

    Careful design of the application can keep this from being a problem. For instance, users might queue updates and submit them to a centralized "transaction update" process. If that central process is the only thing updating the tables, the page contention goes away, and there may be a performance increase because this form of locking is a bit less expensive than row locking. But it does require careful application design...

Another daunting issue is that a number of notable commercial database systems have, as a specific license clause, the condition that you are forbidden to publicly report performance benchmarks.

Recent comments

12 Aug 2007 19:40 Avatar dubnerm

sqlite-gdbm dependancy
You are writing "SQLite -- an SQL database implemented atop GNU gdbm". Is it true?

For all that I know, SQLite's best approach is independency of 3rd-party libraries. May be you mean, that first implementation uses gdbm compatible format? Could you provide more info?

BTW: Shouldn't articles in this category be reviewed and update from time to time?

30 May 2006 11:36 Avatar slyzius

ScimoreDB
http://www.scimore.com

Here is another Distributed SQL RDBMS, which offers both client/server and embedded database system.

08 Jun 2004 07:53 Avatar wagnerstefan

Re: Access replacement


I don't want an access replacement to occur on linux, and support most of the comments from mly.

When I generate sql-statements, I do it with the keyboard. I can use these and move them to a postgres, oracle or informix - database.

On access the sql-interface is a tool to punish users. It doesn't even allow indentation to make your code structured and readable.

Having an graphical userinterface is nice, if you don't exceed 10 tables.

Every serious database has the possibility, to access it from the commandline, which is needed for automated processes, invoked from scripts of all kind - a sheduler for example.

And the performance is lousy.

And if you use it, you're stuck on it. The flypaper under the databases - if you would call it a database.

Access is the typical ms-application: Fast take-off, fast crah. :)

25 Jan 2004 21:19 Avatar perlchild

Re: SAPdb embedded?

> SAPdb certainly isn't an embedded
> database. It's a a relatively
> heavyweight and complete database that
> runs standalone. For me, an embedded
> database is on that can be linked in as
> a library, which isn't the case here.

Perhaps, but from the point of view where the database is optimized for a particular application and/or is designed from the point of view of that application, sapdb would be an embedded database made for SAP R/3 and related solutions(despite the fact you can run SAP solutions on other databases).


06 Feb 2002 06:37 Avatar scooter

Re: Access replacement

good peace of software.
>
> features ? (access 97 - good one)
> 1) easy to maitence tables, indexes,
> relations
> 2) very good forms, subforms, objects
> (combo, lists)
> 3) v.v.v. good reports (subreports,
> groups of data, headers, footers for
> every goups - v. good)
> 4) v.v.v. easy vb & dao
> 5) v. easy to connect to sql servers
> (by odbc)
> Belive me. I have eg one system with
> near 100 tables, 200 forms, 50 reports,
%I think about Delphi/Kylix, but
> I don't know.
>
> br
> slv
>

A database should store data. Nothing more, nothing less. What/how you choose to create the applications that reference, supply and alter the data contained in the database is a *completley* seperate issue. As others have pointed out - you can write your apps in anything from C, Java, PHP to Kylix - it matters not - as long as they support some method of communicating with the database engine (eg via SQL or ODBC). Acesss is not a DBMS - there is no DBMS involved with access - the data is stored in a file, and all the clients fight over the file. Its a great tool for prototyping, and prodcuing a small scale system that will always *stay* small - as long as you don't mind the clunky way it looks, and the huge amounts of resources your apps take up on their host PCs. However - Access apps just don't scale well - try your apps with 100 users all banging away at the forms and it will soon begin to show signs of falling apart.

Almost everything I produce these days, I do as a HTML based front end written in PHP (but it could just as easily be JSPs or even C at the back end). The DBMS I use is MySQL but again, any reasonable one would do. This means - no client software = no user fiddling = no client problems, and no rollouts to do.

And if I want a really big app, I can reaplce MySQL with a heavyweight DBMS with transaction support and features like bitmapped indexes, paralell loading and so forth (like, say: Oracle), with only minor changes to the app code - and no chnages at all on the client PCs.

Access is a toy from an earlier age when "doing it properly" was out of reach of many small businesses. This is not the case these days, and there's no excuse for it. You don't sound like you fall into this category but it always seems to be used by people who have not grasped the first principles of data design, normalisation or basic good coding practice. They think they can operate it like Word - because it has buttons and looks the same. This is a problem that permeates the Windows world - their server OS has the same interface as their desktop offering, and so everyone thinks they are systems architects or sys admins all of a sudden.

Screenshot

Project Spotlight

Kigo Video Converter Ultimate for Mac

A tool for converting and editing videos.

Screenshot

Project Spotlight

Kid3

An efficient tagger for MP3, Ogg/Vorbis, and FLAC files.