Tips for Better Database Design
Over the years, working as a data modeler and database architect, I have
noticed that there are a couple rules that should be followed during
data modeling and development. Here I describe some tips in the hope
that they might help you. I have listed the tips in the order that they
occur during the project lifecycle rather than listing them by
importance or by how common they are.
In application development, it is critical to understand the nature of the data.
Planning is often ignored so that developers can just “start coding”. The project starts and when issues come up, there is no slack in the schedule to address them. Developers take shortcuts with the intent to fix them later but this rarely if ever happens.
Careful planning is how to ensure that you end up with a proper database that is not hacked together. If you don’t spend the time and effort upfront addressing the data required by the processes, you’ll pay for it later with a database that must be reworked, replaced or scrapped.
Even if planning isn’t always done, many data modelers still follow these guidelines. That’s not to say we can predict every design need in advance, but most modelers believe that it’s worth the effort to understand the data and its usage. You would not want a design for transaction processing when the need is analytic report creation.
Times have changed; Agile methodologies are more prevalent so database teams must rethink their approach to data modeling. In Agile, the Domain Model from Use Cases is used instead of Entity Relationship Diagrams. However, the need for planning has not diminished. We need to understand the data and what it’s supposed to do. In general, the first few Sprints must focus on data design.
So it’s not Agile that is the issue for database modelers, but rather individuals who do not grasp the nature of data. Some see database development as the same as application development. Database modeling and software development are different and need appropriate focus.
The database is the core of most software applications. You must take the time to analyze the requirements and how the data model will meet them. This decreases the chance that the development will lose course and direction.
The developers must understand the importance of data and its contribution to the development process. We live in the information age. Applications display and manipulate data. It is the information contained in the data that gives meaning to the application.
It is not possible to foresee every requirement nor every issue, but it is important to prepare for problems by careful planning.
Using a naming convention is one step towards effective documentation. When you have to make changes in the future, you will appreciate any existing documentation. A short, simple document that describes the decisions that you made and describes the design will help explain the design choice at that time.
You want enough documentation so that the database can be managed by a new administrator and they can understand the meaning without having to come back to you for explanation. If the data model and the environment are not documented, it is difficult to maintain or change it as requirements change.
To some extent, documentation has little to do with the data modeling. Documentation is about communicating the design and making it understandable in the future.
Documentation is often an afterthought. When schedules are short, documentation gets ignored. Yet, this is technical debt with a high cost. Cutting corners during the development cycle will accrue costs in the future for database changes, problem identification, tracking bugs and for understanding the data model and nature of the data.
As an example, data models often have an “ID” field as the primary key for a table or a portion of the name of a key. This might be a primary key like
Document the definition of tables, columns and relationships so that programmers can access the information. The documentation must describe expectations of the database structure.
In the Vertabelo tool, I can immediately include comments on any item: tables, columns, references, alternate keys, which means that the documentation is stored immediately with my model rather than in some extra document to be maintained separately.
Inconsistent naming serves no purpose. It only frustrates developers who must access the data, administrators of the database, and modelers who must make changes in the future.
When “ID” is used for some artificial keys but some tables use a different naming convention (such as Number), developers, analysts, and DBAs may waste time to understand the exceptions. Weak naming conventions also lead to errors in development because the naming is not consistent.
Hand-in-hand with documentation, using a naming convention makes it in the future for someone to understand the model. Do not randomly switch between using “ID” (like
The same applies to cryptic names like “XRT1” – is that the extended reference tables? Your guess is as good as mine. I hope that the designer knew why he chose such a cryptic name, but I doubt that the next person to access the database can guess the reason.
Naming conventions are a matter of personal choice. Make sure decisions are consistent and documented.
For proper normalization, each table needs an identifying key. Uniqueness must be guaranteed. Yet, natural keys and primary keys don’t have to be the same. In fact, they may not be, as long as the table has a natural key.
Some data modelers prefer an artificial key for uniqueness. Yet some modelers prefer a natural key to ensure data integrity.
So, should we use a natural key as the primary key? One challenge arises if the natural key must be changed. If the natural key consists of many columns, you may need to make changes in many places. Another challenge is using an artificial key as the only key for a table.
As an example, you might have a table storing information about products. The table may be defined with an artificial key such as a sequence, a code for the short alphabetic name for the product and the product definition. If uniqueness is ensured only by the artificial key, there may be two rows with the same product code. Are these the same product that is entered twice? Perhaps a key with the product code is more appropriate.
Domain tables are effective for enforcing integrity. Domain tables work well when there are many values to be checked against, or the values to be checked are frequently changing.
One issue can be that developers decide that the application will check integrity. The issue here is that a central database might be accessed by many applications. Also, you generally want to protect the data where it is: in the database.
If the possible values are limited or in a range, then a check constraint may be preferable. Let’s say that messages are defined as either Incoming or Outgoing, in which case there is no need for a foreign key. But, for something like valid currencies, while these may seem static, they actually change from time-to-time. Countries join a currency union and currencies change.
Applications should also perform integrity checks, but don’t rely only on the application for integrity checking. Defining integrity rules on the database ensures that those rules will never be violated. In this way, the data satisfies the defined integrity rules at all times.
Indexing is an ongoing process. During design, you start the process on your model. Design work is on the primary keys and constraints.
Indexes are important when considering queries on the data. When modeling, you should consider how the data will be queried. Take care not to over-index. Indexing revolves around query optimization.
MUCK tables contain unrelated data.
For example, you could have a table that defines the domain, entry and a description. Thinking back to the message example above, two entries might be:
Now add in entries for another domain:
This is just a mess. What does the table mean?
Just for fun, I modeled a simple example of a MUCK table (or OTLT, “One True Lookup Table” if you are a Tolkien fan) and included some comments. Please note that this is an anti-pattern and I am not recommending that you use it in your data model.
One approach that I advocate is to include data retention as part of your design considerations. Will you have active and historical tables so that inserts of new rows in the active tables remain fast, while searches on historical data can be optimized?
This avoids having to redesign archiving into your database on top of the original design.
In testing the database, the goal should be to simulate a production environment: “A Day in the Life of the Database”. What volumes can be expected? What user interactions are likely? Are the boundary cases being handled?
So the testing plan and proper testing must be an integral part of the data modeling and database development.
1. Plan Ahead
One problem that I have seen is when data modeling occurs at the same time as software development. This is like building the foundation before completing the blueprints. In the past, planning seemed an obvious step before starting development. Development teams would not create databases without planning just like architects would not build buildings without blueprints.In application development, it is critical to understand the nature of the data.
Planning is often ignored so that developers can just “start coding”. The project starts and when issues come up, there is no slack in the schedule to address them. Developers take shortcuts with the intent to fix them later but this rarely if ever happens.
Careful planning is how to ensure that you end up with a proper database that is not hacked together. If you don’t spend the time and effort upfront addressing the data required by the processes, you’ll pay for it later with a database that must be reworked, replaced or scrapped.
Even if planning isn’t always done, many data modelers still follow these guidelines. That’s not to say we can predict every design need in advance, but most modelers believe that it’s worth the effort to understand the data and its usage. You would not want a design for transaction processing when the need is analytic report creation.
Times have changed; Agile methodologies are more prevalent so database teams must rethink their approach to data modeling. In Agile, the Domain Model from Use Cases is used instead of Entity Relationship Diagrams. However, the need for planning has not diminished. We need to understand the data and what it’s supposed to do. In general, the first few Sprints must focus on data design.
So it’s not Agile that is the issue for database modelers, but rather individuals who do not grasp the nature of data. Some see database development as the same as application development. Database modeling and software development are different and need appropriate focus.
The database is the core of most software applications. You must take the time to analyze the requirements and how the data model will meet them. This decreases the chance that the development will lose course and direction.
The developers must understand the importance of data and its contribution to the development process. We live in the information age. Applications display and manipulate data. It is the information contained in the data that gives meaning to the application.
It is not possible to foresee every requirement nor every issue, but it is important to prepare for problems by careful planning.
2. Document Your Model
When making a data model, everything seems obvious. You name the objects so that their purpose is evident and everyone will understand the meaning just by reading the name. This may be true, but it isn’t as obvious as you might think. When choosing names for tables and columns, make it clear what the usage of each object will be. Over time, the meaning of objects will be unclear without documentation.Using a naming convention is one step towards effective documentation. When you have to make changes in the future, you will appreciate any existing documentation. A short, simple document that describes the decisions that you made and describes the design will help explain the design choice at that time.
You want enough documentation so that the database can be managed by a new administrator and they can understand the meaning without having to come back to you for explanation. If the data model and the environment are not documented, it is difficult to maintain or change it as requirements change.
To some extent, documentation has little to do with the data modeling. Documentation is about communicating the design and making it understandable in the future.
Documentation is often an afterthought. When schedules are short, documentation gets ignored. Yet, this is technical debt with a high cost. Cutting corners during the development cycle will accrue costs in the future for database changes, problem identification, tracking bugs and for understanding the data model and nature of the data.
As an example, data models often have an “ID” field as the primary key for a table or a portion of the name of a key. This might be a primary key like
TransactionID
on the Transaction
table. If some tables use “Number” as part of the name of a key, then it is good to document why. Perhaps ReferenceNumber
is used as the name of the primary key on the Message because that is
what the reference is called in the business area. For example, in
financial services, financial messages typically include a reference
number.Document the definition of tables, columns and relationships so that programmers can access the information. The documentation must describe expectations of the database structure.
In the Vertabelo tool, I can immediately include comments on any item: tables, columns, references, alternate keys, which means that the documentation is stored immediately with my model rather than in some extra document to be maintained separately.
3. Follow Conventions
Naming conventions might not appear important during the design. In reality, names provide the insight to understanding a model. They are an introduction and should be logical.Inconsistent naming serves no purpose. It only frustrates developers who must access the data, administrators of the database, and modelers who must make changes in the future.
When “ID” is used for some artificial keys but some tables use a different naming convention (such as Number), developers, analysts, and DBAs may waste time to understand the exceptions. Weak naming conventions also lead to errors in development because the naming is not consistent.
Hand-in-hand with documentation, using a naming convention makes it in the future for someone to understand the model. Do not randomly switch between using “ID” (like
CustomerID
) and “Number” (AccountNumber
)
as the keys for tables. Only make exceptions to the conventions when
they are justified. Document what the exception is and why the
convention is not respected.The same applies to cryptic names like “XRT1” – is that the extended reference tables? Your guess is as good as mine. I hope that the designer knew why he chose such a cryptic name, but I doubt that the next person to access the database can guess the reason.
Naming conventions are a matter of personal choice. Make sure decisions are consistent and documented.
4. Think Carefully About Keys
Keys often generate controversy: primary keys, foreign keys, and artificial keys. Tables need a primary key that identifies each row. The art is to decide which columns should be part of the primary key and what values to include.For proper normalization, each table needs an identifying key. Uniqueness must be guaranteed. Yet, natural keys and primary keys don’t have to be the same. In fact, they may not be, as long as the table has a natural key.
Some data modelers prefer an artificial key for uniqueness. Yet some modelers prefer a natural key to ensure data integrity.
So, should we use a natural key as the primary key? One challenge arises if the natural key must be changed. If the natural key consists of many columns, you may need to make changes in many places. Another challenge is using an artificial key as the only key for a table.
As an example, you might have a table storing information about products. The table may be defined with an artificial key such as a sequence, a code for the short alphabetic name for the product and the product definition. If uniqueness is ensured only by the artificial key, there may be two rows with the same product code. Are these the same product that is entered twice? Perhaps a key with the product code is more appropriate.
5. Use Integrity Checks Carefully
To ensure data integrity, we need foreign keys and constraints. Be careful not to overuse or underuse these integrity checks.Domain tables are effective for enforcing integrity. Domain tables work well when there are many values to be checked against, or the values to be checked are frequently changing.
One issue can be that developers decide that the application will check integrity. The issue here is that a central database might be accessed by many applications. Also, you generally want to protect the data where it is: in the database.
If the possible values are limited or in a range, then a check constraint may be preferable. Let’s say that messages are defined as either Incoming or Outgoing, in which case there is no need for a foreign key. But, for something like valid currencies, while these may seem static, they actually change from time-to-time. Countries join a currency union and currencies change.
Applications should also perform integrity checks, but don’t rely only on the application for integrity checking. Defining integrity rules on the database ensures that those rules will never be violated. In this way, the data satisfies the defined integrity rules at all times.
6. Don’t Forget Indexes in Your Design
Some indexing design is useful during database modeling, even if indexes may change during actual deployment and usage. Of course, it is possible to have too many indexes, just like it is possible to have too few.Indexing is an ongoing process. During design, you start the process on your model. Design work is on the primary keys and constraints.
Indexes are important when considering queries on the data. When modeling, you should consider how the data will be queried. Take care not to over-index. Indexing revolves around query optimization.
7. Avoid Common Lookup Tables
I have often seen a common lookup table for attribute pairs. Defining a single, generic domain table is perceived to simplify the design. This style of domain table makes an abstract definition for holding text. I have heard it called an “Allowed Value” or “Valid Values” table, but the term “MUCK” table was coined for this anti-pattern in 2006: Massively Unified Code-Key.MUCK tables contain unrelated data.
For example, you could have a table that defines the domain, entry and a description. Thinking back to the message example above, two entries might be:
Domain | Entry | Description |
---|---|---|
1 | I | Incoming message received by the bank |
1 | O | Outgoing message sent by the bank |
Now add in entries for another domain:
Domain | Entry | Description |
---|---|---|
2 | COVER | Cover payment |
2 | SERIAL | Serial payment |
2 | SSI | Standard Settlement Instructions |
This is just a mess. What does the table mean?
Just for fun, I modeled a simple example of a MUCK table (or OTLT, “One True Lookup Table” if you are a Tolkien fan) and included some comments. Please note that this is an anti-pattern and I am not recommending that you use it in your data model.
8. Define an Archiving Strategy
All too often, I have seen databases created without a proper strategy of data retention and archiving. How long will data be kept online available in active database tables? Most systems are built to keep data in the database “forever”. For most systems, this is not a reasonable long-term data retention strategy. At some point, active data should be archived.One approach that I advocate is to include data retention as part of your design considerations. Will you have active and historical tables so that inserts of new rows in the active tables remain fast, while searches on historical data can be optimized?
This avoids having to redesign archiving into your database on top of the original design.
9. Test Early, Test Often
Testing is always a challenge in the development plan. There is often a test phase at the end of an Agile Sprint and system testing at the end of development. Testing is generally the first thing to be squeezed when time gets short.In testing the database, the goal should be to simulate a production environment: “A Day in the Life of the Database”. What volumes can be expected? What user interactions are likely? Are the boundary cases being handled?
So the testing plan and proper testing must be an integral part of the data modeling and database development.
Comments
Post a Comment