Dirsync Problem Summary and Update

 

Problem Summary

Updates

Technical Information



Summary

Dirsync is a process present in all components of the iPlanet mail system that maintains a local copy of the enterprise LDAP directory. It accomplishes this by querying the LDAP directory for all users once a day, and querying for differential updates at regular intervals throughout the day.

When dirsync pulls the LDAP directory, it does so using a number of extremely broad and inefficient queries. These queries are saturating the disk channels on our LDAP servers, and are causing a number of other performance problems.

MST has been working with Sun/iPlanet to address what we thought was a problem with dirsync. After several consultations, we've reached the conclusion that dirsync is functioning as designed - it's just a terribly inefficient and resource intensive process. We now know that the problem is one of optimization to support dirsync in addition to our normal application load.

MST is now working on mechanisms that can support these dirsync operations without the performance problems they're currently causing. This will involve optimizing the LDAP servers for a different class of queries (with extremely large result sets) that are counter to the types of queries we currently support. We therefore need to spend some testing and experimentation time to find the right balance between optimizing for dirsync and for other applications.

Since dirsync introduces a new dynamic, we need to take a look at the types of optimizations we can perform through software and OS tuning. If software and OS tuning is inadequate, we'll need to look at hardware or other configuration options. Testing is underway to determine what software and OS tuning can achieve, and how this will relate to the purchase of two new servers for LDAP/MMP separation.


Updates

12/27/2001 - First cut of baseline data

Baseline data has been assembled, but pointed out that there still appears to be some configuration difference between the ITE and Prod. It appears that there is a dirsync process running against the secondary ITE LDAP server, where this is not the case in production. The differences between production and test must be ironed out before we can complete our baselining.

1/8/2002 - Baseline information complete

All of our baseline information is complete and proportional between production and test. We'll start with our first round of changes tonight.

1/8/2002 - Cache Changes

Two changes implemented:

  • Moved LDAP server cache to ramdisk
  • Tuned LDAP server cache down to 512k and EntryCache to 0, effectively disabling both of them.

LDAP server performance appears to have remained constant despite these changes (good thing). The slapd process size is far smaller and disk i/o is greatly reduced.

We still have a proportionally high number of writes on Oin and Gloin, but this is likely due to the fact that the cache changes alleviated enough pressure on the machines that usage patterns are now showing up in the trend data (ie - reads are low because there is no load on the ITE).

1/10/2002 - Review

We haven't seen any negative side effects in the ITE and machine stats still look good. Writes are still proportionally high, but we've verified that this is due to the lack of load in the ITE. Next step is to plan this change for production.

1/13/2002 - Cache Changes rolled to prod LDAP cluster

We moved our cache changes to the production LDAP cluster and restarted the LDAP processes. The machines seem to be much less loaded and disk I/O has gone down substantially. This is probably a clear enough indicator to let the replacement hardware order proceed.


Technical Resources

Machine Baseline Data

First Round of Changes (Cache Mods to ITE)

Prod Implementation (Cache Mods to prod LDAP cluster)