Task: Design Database

Purpose

To ensure that persistent data is stored consistently and efficiently and to define the behavior that must be implemented in the database.

Relationships

RolesPrimary Performer: Additional Performers:
InputsMandatory:
    Optional:
    • None
    Outputs

      Main Description

      Designing the database is done to identify the persistent classes in the design, design appropriate database structure to store the persistent classes, and define the mechanisms and strategies for storing and retrieving persistent data in such a way that performance criteria for the system are met. The database administrator is responsible for designing the database in an efficient way that can be understood.

      Steps

      Map Persistent Design Classes

      Persistent classes in the design model represent the information the system must store. Conceptually, these classes may resemble a relational design (for example, the classes in the design model may be reflected in some fashion as entities in the relational schema). As we move from elaboration into development, however, the goals of the design model and the relational data model diverge. For example, in the design model a class of Customer might be identified with the following:  name: String, address: String, and customerID: String. When modeled in the RBBMS, this would translate to a table called customer with the columns: Name, Address and Customer ID.

      Optimize Data Model for Performance

      Objects which are retrieved together can be stored together for improved performance.

      In the case of a relational data model, the initial mapping generally yields a simple class-to-table mapping. If objects from different classes need to be retrieved at the same time, the RDBMS uses an operation called a table join to retrieve the rows related to the objects of interest. For frequently accessed data, join operations can be very computationally expensive. To eliminate the cost of the join, a standard relational technique is de-normalization.

      De-normalization combines columns from two or more different tables into the same table, effectively pre-joining the information. De-normalization reflects a trade-off between more expensive update operations in favor of less expensive retrieval operations. De-normalization also reduces the performance of the system in queries which are only interested in the attributes of one of the objects which are effectively joined in the de-normalized table, since all attributes are retrieved on every query. But for cases where the application normally wants all attributes, there can be a significant performance improvement.

      De-normalizing more than two tables is rare and increases the cost of inserts and updates as well as the cost of non-join queries. Limiting de-normalization to two tables is a good policy unless strong and convincing evidence can be offered on the benefits.

      In some cases, optimizing the data model may unmask problems in the Design Model, either performance bottlenecks, poor modeling, or incomplete designs. In this event, discuss the problems with the Developer of the class, triggering change requests where appropriate.

      Optimize Data Access

      Once the table structure has been designed, determine the types of queries that will be performed against the data. Indexing is used by the database to speed access. The following indexing strategies should be considered:

      ·         The primary key column of the table should always be indexed. Primary key columns are used frequently as search keys and for join operations.

      ·         Tables smaller than about 100 rows benefit little from indexing because they tend to be in-memory in the database cache (the actual number of rows is related to how many rows will fit in a data block, which is typically between 512 and 2048 bytes in size. The unit of I/O a database performs is not a row but a block, when viewed from the physical perspective).

      ·         Indexes should also be defined for frequently executed queries, or queries which must retrieve data quickly (generally any searches done while a person may be waiting). An index should be defined for each set of attributes which are used together as search criteria. For example, if the system needs to be able to find all Orders on which a particular product is ordered, there would need to be an index on the Line Item table, on the product number column.

      ·         Indexes should generally be defined only on columns used as identifiers, not on numeric values (things like account balances) or textual information (such as order comments). Identifier column values tend to be assigned when the object is created and then remain unchanged for the life of the object.

      ·         Indexes should be on simple numbers (integer and number data types) rather than floating point numbers, and should rarely be on strings. Comparison operations are much simpler and faster on numbers than they are on strings, and given the large data volumes processed on a query or a large join, small savings add up quickly. Indexes on numeric columns tend to take significantly less space as well.

      On the down-side, the use of indexes is not free; the more indexes on a table, the longer inserts and updates will take to process. The following precautions should temper the use of indexes:

      ·         Don't index just to speed up an infrequently executed query, unless the query occurs at a critical point and maximum speed is essential.

      ·         In some systems, update and insertion performance is more important than query performance. A common example is in factory data acquisition applications, where quality data is captured in real-time. In these systems, there are only occasional online queries, and most of the data is analyzed periodically by batch reporting applications that perform statistical analysis on the data. For data-acquisition systems, remove all indexes to achieve maximum through-put. If indexes are needed, they can be re-built just before the batch reporting/analysis applications run, then dropped when the reporting/analysis is completed.

      ·         Keep in mind that indexes have a hidden cost: indexes cost time to update (a tax paid on every insert, update, or delete) and occupy disk space. Be sure you get value from using them.

      Many databases offer a choice of index types. The most common include:

      ·         B-tree indexes. The most frequently used kind are based on balanced b-tree index data structures. They are good when the index key values are randomly distributed and tend to have wide variability. They tend to perform poorly when data being indexed is already in sequential ordered.

      ·         Hashed indexes. Less frequently, index key values are hashed. Hashing offers better performance when the range of index key values is known, relatively unchanging, and unique. Hashing relies upon using the key value to calculate the address of the data of interest. Because of the need for predictability, hash indexes tend to be useful only for medium-sized look-up tables which change very infrequently.

      Choice of indexing strategy can have a large impact on performance, as can the timing of index creation can also affect performance. Bulk data-loads should be performed without indexes (this can be achieved by dropping the index, loading the data and then re-creating the index). The reason for this is that as each row is added, the index structure is re-balanced. Since subsequent rows will change the optimal index structure, the work done re-balancing the index as each row is inserted is largely wasted. It's faster and more efficient to load data without indexes, then re-create the index when the data load is done. Some databases provide bulk data-loaders to do this automatically.

      Define Storage Characteristics

      Databases perform I/O not on rows or records, or even whole tables, but on disk blocks. The reason for this is simple: block I/O operations are usually optimized in software and hardware on the system. As a result, the physical organization of the tables and indexes in the database can have a dramatic impact on the performance of the system. There are several dimensions:

      ·          The density of information in the disk pages.

      ·          The location of disk pages on disk, and across disk drives.

      ·          The amount of disk space to allocate to the table.

      Define Reference Tables

      Often there are standard look-up tables, validation tables, or reference tables used throughout the project. Since the data in these tables tends to be frequently accessed but changes infrequently, it is worth special consideration. In the design model, these tables are things like standard product codes, state or province codes, postal or zip codes, tax tables, area code validation tables, or other frequently accessed information. In financial systems this might be lists of policy codes, insurance policy rating categories, or conversion rates. Look in the design model for classes which are primarily read, providing validation information for a large number of clients.

      To tie-in earlier steps, the data should be located on a fast drive, in relatively packed data pages. If it is small, don't bother indexing it, since for small tables indexing actually adds more overhead. A small frequently accessed table will also tend to remain in memory, as the least-recently-used (LRU) caching algorithms tend to keep these tables pegged in the data cache.

      Make sure the database cache is large enough to keep all reference tables in memory, if possible, plus normal "working set space" for queries and transactions. Often, the secret to increasing database performance is reducing I/O. Relatively speaking, memory is inexpensive, so if you need more, buy it.

      Once the reference table structures are defined, determine a strategy for populating the reference table. Since these are accessed early in the project, determining the reference values and loading the tables often needs to occur relatively early in the project. While the Data Designer is not responsible for obtaining the data, she is responsible for determining how and when the reference tables will be refreshed.

      Define Data and Referential Integrity Enforcement Rules

      Data integrity rules, also known as constraints, ensure that data values lie within defined ranges. Where these ranges can be identified, the database can enforce them. (This is not to say that data validation should not be done in the application, but only that the database can server as a 'validator of last resort' in the event that the application does not work correctly). Where data validation rules exist, define the database constraints to enforce them.

      In addition, referential integrity rules can be enforced by the database. These ensure that for every foreign key value there is a primary key value in a referenced table. Definition of foreign key constraints is also often used by the query optimizer to accelerate query performance. Based on associations in the Design Model, create foreign key relationship constraints on the associated tables in the Data Model. In many cases, the foreign key enforcement rules will use the reference tables discussed in the previous step.

      Distribute Class Behavior

      Most databases support a stored procedure capability. A stored procedure is executable code which runs within the process space of the database management system, and provide the ability to perform database-related actions on the server without having to transfer data across a network. Their judicious use can the improve performance of the system.

      Stored procedures usually come in two flavors: actual procedures and triggers. Procedures are executed explicitly by an application, generally have parameters, and can provide an explicit return value. Triggers are invoked implicitly when some database event occurs (insert a row, update a row, delete a row, etc.), have no parameters (since they invoke implicitly), and do not provide explicit return values.

      In database systems that lack constraints, triggers are often used to enforce referential and data integrity. Otherwise, they tend to be used when an event needs to trigger (or cause) another event.

      The Design classes should be examined to see if they have operations which should be implemented using the stored procedure or trigger facility. Candidates include:

      ·         Any operations which primarily deal with persistent data (creating, updating, retrieving or deleting it).

      ·         Any operations in which a query is involved in a computation (such as calculating the average quantity and value of a product in inventory).

      ·         Operations which need to access the database to validate data.

      Remember that improving database performance usually means reducing I/O. As a result, if performing a computation on the DBMS server will reduce the amount of data passed over the network, the computation should probably be performed on the server.

      Work with the Designer of the class to discuss how the database can be used to improve performance. The Designer will update the operation method to indicate that one or more stored procedures can be/should be used to implement the operation.

      Review Design

      Using the Data Model Checklist below, assess the completeness and quality of the data model.

      Key Considerations

      The steps presented assume a relational data model. The steps for an object-oriented database are similar in nature, but differ in the details.

      More Information