During a normalization, a table is decomposed into more tables. This will ensure data is logically stored, eliminate data redundancy, and also avoid insertion, update and deletion anomalies.
However, more tables means more joins are required to make a query. This is fine, if tables have few hundreds of records. But for tables containing thousands or even millions of records - it will have a negative impact on performance.
Lets assume we have fully normalized design as below:
Projects(ProjectId, ProjectName, CustomerId) ProjectDetails(ProjectId, ItemId, ActivityId, EmployeeId, WorkDate, TimeSpent) Customers(CustomerId, CustomerName) Activities(ActivityId, ActivityName) Employees(EmployeeId, EmployeeName)
Suppose there is a performance problems with a report that calculates total time spent on a project. This is because you have to aggregate project detail rows in order to get information about time spent for each activity on a project.
In such a circumstance, performance can be improved by adding extra column, TotalTimeSpent
, on a Projects table.
Projects(ProjectId, ProjectName, CustomerId, TotalTimeSpent)
This additional column brings the danger of update anomalies back to the database. These anomalies can be overcome by documenting and ensuring application maintains denormalized data by using transactions appropriately. For e.g., a transaction might mean an insertion on a table must be followed by an update to a TotalTimeSpent
column. Entire transaction could be rolled back if one of those action fails to ensure data consistency. Even better, database trigger can be used. For example, fire a trigger automatically to update the column when data is inserted into a table.
Suppose your company requires you to maintain a customer's name from the time a project have been started. A customer can change the name. How would you meet this requirement?
In the customer table, only the current name of the customer is maintained. If it is required to maintain the historic record, customer name need to be added to the projects table. Also it is good practice to have the information about the date when the name was valid. So add the start date too. Now the denormalized database design would look like below:
Projects(ProjectId, ProjectName, CustomerId, TotalTimeSpent, StartDate, CustomerName)
In summary, denormalization can lead to improved performance as fewer joins are required, and it can also be useful for maintaining historic information. However, it can lead to data inconsistency which must be updated by using transaction appropriately or using database triggers.