Monday, April 13, 2009

Access vs SQL Server: some stats (part 1)

Microsoft Access In the perspective of upsizing my current Access 2007 application, I have been trying to understand a bit more about the possible performance impact of various choices of Primary Keys.

My problem is that currently, the Access application uses autoincrement numbers as surrogate Primary Keys (PK). Since I will need to synchronise the data over multiple remote sites, including occasionally disconnected clients, I can’t use the current autoincrement PK and will need to change to GUID.

To see for myself what could be the impact, I made a series of benchmarks.
This first part is fairly simple:

  • Populate a Product table that contains 3 fields: ID, SKU and Designation with 1,000,000 records.
  • Test natively on SQL Server and Access 2007.
  • The records are inserted in transactions batches of 1000 records.
  • I collect the time taken for each of these transactions and plot it.

Test setup

Nothing much to say about that:

All tests are performed on a dedicated Windows Server 2008 x64 rack running Access 2007 and SQL Server 2008 Standard (SP1) x64.

Test database

In SQL Server, we created a database with two tables ProductGUID and ProductInt:

CREATE TABLE ProductGUID (
    ID UNIQUEIDENTIFIER DEFAULT NEWSEQUENTIALID(),
    SKU NVARCHAR(32) NOT NULL,
    Description NVARCHAR(255) NULL
);
CREATE CLUSTERED INDEX ProdGUIDix ON ProductGUID(ID);
GO

CREATE TABLE ProductINT (
    ID INT IDENTITY(1,1),
    SKU NVARCHAR(32) NOT NULL,
    Description NVARCHAR(255) NULL
);
CREATE CLUSTERED INDEX ProdINTix ON ProductINT(ID);
GO

For the table using a GUID, we use the NewSequentialID() instead of NewID() to create new keys. This is supposed to offer much better performance as the generated GUIDs are guaranteed to be sequential rather than random, resulting in better index performance on insertion.

For the Access version of the tables, we basically use the same definition, except that we used 4 tables:

  • ProductINT: let Jet/ACE autonumbering create the sequential integer Primary Key.
  • ProductINTRandom: let Jet/ACE autonumbering create the random integer Primary Key.
  • ProductGUIDRandom: let Jet/ACE use its own internal GenGUID() for the key which generates random GUIDs instead of sequential ones.
  • ProdcutGUIDSequential: call the Windows API (UuidCreateSequential) to create sequential ID instead.

SQL Server Test code

Using the SQL Server Management Studio, we performed the following test once for each table (resetting the database in-between):

SET NOCOUNT ON;
GO

DECLARE @i INT = 1;
WHILE (@i <= 1000)
BEGIN
    DECLARE @tstart DATETIME2 = SYSDATETIME();
    BEGIN TRAN
        DECLARE @a INT = 1;
        WHILE (@a <= 1000)
        BEGIN
            INSERT INTO ProductGUID (SKU,Description) 
            VALUES ('PROD' + CONVERT(CHAR,@a), 'Product number ' + CONVERT(CHAR,@a));
            SELECT @a = @a + 1;
        END;
    COMMIT TRAN;
    SELECT DATEDIFF(MILLISECOND, @tstart, SYSDATETIME()) AS timespan;
SELECT @i = @i + 1;
END;
GO

Basically, we perform 1000 transactions each inserting 1000 records into the table ProductGUID or ProductINT.

Access 2007 Test code

To duplicate the same conditions, the following VBA code will perform 1000 transactions each inserting 1000 records.
Note that the recordset is opened in Append mode only.
The importance of this will be discussed in another article.

' Run this to inset 1,000,000 products in batches of 1000
' In the given table
Public Sub Benchmark(TableName as String, InsertSeqGUID  as Boolean)
    Dim i As Integer
    For i = 1 To 1000
        Insert1000Products TableName, InsertSeqGUID 
    Next i
End Sub

' Insert 1000 products in a table
Public Sub Insert1000Products(TableName as String, InsertSeqGUID as boolean)
    Dim i As Long
    Dim db As DAO.Database
    Dim rs As DAO.Recordset
    Dim ws As DAO.Workspace
    Dim starttime As Long
    Dim timespan As Long

    Set ws = DBEngine.Workspaces(0)
    DoEvents
    starttime = GetClock ' Get the current time in ms
    ws.BeginTrans
    Set db = CurrentDb
    Set rs = db.OpenRecordset(TableName, dbOpenDynaset, dbAppendOnly)
    With rs
        For i = 1 To 1000
            .AddNew
                If InsertSeqGUID Then !ID = "{guid {" & CreateStringUUIDSeq() & "}"
                !SKU = "PROD" & i
                !Description = "Product number " & i
            .Update
        Next i
    End With
    ws.CommitTrans
    rs.Close
    timespan = GetClock() - CDbl(starttime)
    Set rs = Nothing
    Set db = Nothing
    ' Print Elapsed time in milliseconds
    Debug.Print timespan
    DoEvents
End Sub

We call this code to perform inserts on each of our Access tables:

  • ProductINT table: we just insert data in the ProductINT table, letting Access create autonumber IDs.
  • ProductINTRandom table: we just insert data in the ProductINTRandom table, letting Access create random autonumber IDs.
  • ProductGUIDRandom table: we let Access create the Random GUID for the primary key.
  • ProductGUIDSequential: we use the Windows API to create a sequential ID that we insert ourselves.

Test results

Without further ado, here are the raw results, showing the number of inserted record per second that we achieve for each test over the growing size of the database (here are only shown tests comapring Sequantial GUID and Autoincrement on SQL Server and Access, see next sections for the other results):

Inserts per second

What we clearly see here is that performance when using autoincrement and Sequential GUID stays pretty much constant over the whole test.
That’s good new as it means that using Sequential GUIDs do not degrade performance over time.

As a side note, in this particular test, Access offers much better raw performance than SQL Server. In more complex scenarios it’s very likely that Access’ performance would degrade more than SQL Server, but it’s nice to see that Access isn’t a sloth.

Using Sequential GUID vs Autoincrement in Access

The results show that we do take a performance hit of about 30% when inserting Sequential GUID vs just using autonumbers.
We’re still getting good results, but that’s something to keep in mind.

In terms of CPU consumption, here is what we get:

CPU load Access

Random PK, whether they are simple integer or GUID do consume substantially more CPU resources.

Using Sequential GUID vs Identity in SQL Server

Out-of-the box, SQL Server performs quite well and there is not much difference whether you’re using Sequential GUIDs or autoincrement PK.

There is however a surprising result: using Sequential GUIDs is actually slightly faster than using autoincrement!

There is obviously an explanation for this but I’m not sure what it is so please enlighten me 🙂

CPU Consumption:

CPU load SQL Server

Using Random GUID vs Sequential GUID vs Random Autonumber in Access

So, what is the impact of choosing a Sequential GUID as opposed to letting Access create its own random GUIDs?

Inserts per second Random GUID vs Sequential GUID in Access

It’s clear that random GUIDs have a substantial performance impact: their randomness basically messes up indexing, resulting in the database engine having to do a lot more work to re-index the data on each insertion.
The good thing is that this degradation is pretty logarithmic so while it degrades over time, the overall performance remains pretty decent.
While GUIDs are larger than Random Integers (16 bytes vs 4 bytes) the actual performance of inserting records whose PK is a random integrer is actually worse than random GUID…

Provisional conclusions

Here we’ve check the baseline for our performance tests. In the next article, we’ll look exclusively at the performance of inserting data from a remote Access 2007 front end using our VBA code.

Having this baseline will allow us to check the performance overhead of using ODBC and letting Jet/ACE manage the dialogue with the SQL Server backend.

Feel free to leave your comments below, especially if you have any resources or information that would be useful.

Updates

  • 16APR2009: added test of random autonumber as PK in Access.
  • 13APR2009: Original Article

15 comments   |   Filed under :  Database,MSAccess,SQL Server

Sunday, April 12, 2009

Sysadmin: SQL server performance madness

Technology I’ve just lost 2 days going completely bananas over a performance issue that I could not explain.

I’ve got this Dell R300 rack server that runs Windows Server 2008 that I dedicate to running IIS and SQL Server 2008, mostly for development purposes.

Dell PowerEdge R300 Rack servers

In my previous blog entry, I was trying some benchmark to compare the performance of Access and SQL Server using INT and GUID and getting some strange results.

Here are the results I was getting from inserting large amounts of data in SQL Server:

Machine Operating System Test without Transaction Test with Transaction
MacbookPro Windows Server 2008 x64 324 ms 22 ms
Desktop Windows XP 172 ms 47 ms
Server Windows Server 2008 x64 8635 ms!! 27 ms

On the server, not using transactions makes the query run more than 8 seconds, at least an order of magnitude slower than it should!

I initially thought there was something wrong with my server setup but since I couldn’t find anything, I just spend the day re-installing the OS and SQL server, applying all patches and updates so the server is basically brand new, nothing else on the box, no other services, basically all the power is left for SQL Server…

Despair

When I saw the results for the first time after spending my Easter Sunday rebuilding the machine I felt dread and despair.
The gods were being unfair, it had to be a hardware issue and it had to be related to either memory or hard disk, although I couldn’t understand really why but these were the only things that I could see have such an impact on performance.

I started to look in the hardware settings:

Device Manager

And then I noticed this in the Policies tab of the Disk Device Properties :

DISK Device Properties

Just for the lulz of it, I ticked the box, close the properties

Enable advanced performance

And then tried my query again:

Machine Operating System Test without Transaction Test with Transaction
Server Windows Server 2008 x64 254 ms!! 27 ms

A nearly 35 fold increase in performance!

Moral of the story

If you are getting strange and inconsistent performance results from SQL Server, make sure you check that Enable advanced performance option.
Even if you’re not getting strange results, you may not be aware of the issue, only that some operations may be much slower than they should.

Before taking your machine apart and re-installing everything on it, check your hardware settings, there may be options made available by the manufacturer or the OS that you’re not aware of…

Lesson learnt.

Add comment   |   Filed under :  Database,Hardware,SQL Server,sysadmin

Wednesday, April 1, 2009

Access: building ‘upsizable’ applications.

Microsoft Access When you start building an Access application, it’s tempting to just think about today’s problem and not worry at all about the future.
If your application is successful, people will want more out of it and, over time, you’ll be faced with the task of moving the back-end database to a more robust system like SQL Server.

While there are tools like SSMA that can help you move an Access database to SQL Server, a lot of the problems you’ll encounter can be solved before you even have to think about upsizing.
Abiding by a few simple rules will cost you nothing when creating your Access application but will save you a lot of headache if -when- the time comes to upsize.

So here are a few things to keep in mind.

Naming conventions

Access is pretty liberal about naming conventions and it will let you freely name your tables, columns indexes and queries. When these get moved to another database you’ll most probably be faced with having to rename them.
In some cases, you could actually create subtle bugs because something that used to work fine in Access may be tolerated in the new database but be interpreted differently.

  • Do not use spaces or special characters in your data object names.
    Stick to characters in the range A through Z, 0 to 9 with maybe underscores _ somewhere in between (but not at the start or the end).
    Also try to respect casing wherever you reference this name (especially for databases like MySQL which are case-sensitive if the hosted on a Linux platform for instance).
    eg:
    Customer Order Lines (archive) should be CustomerOrderLines_Archive.
    Query for last Year's Turnover should be QueryLastYearTurnover.
    Index ID+OrderDate should become instead ID_OrderDate.

  • Do not use keywords that are reserved or might mean something else whether they are SQL keywords or functions names:
    A column called Date could be renamed PurchaseDate for instance.
    Similarly, OrderBy could be renamed SortBy or PurchaseBy instead, depending on the context of Order.
    Failing to do so may not generate errors but could result in weird and difficult to debug behaviour.

  • Do not prefix tables with Sys, USys, MSys or a tilde ~.
    Access has its own internal system tables starting with these prefixes and it’s best to stay away from these.
    When a table is deleted, Access will often keep it around temporarily and it will have a tilde as its prefix.

  • Do not prefix Queries with a tilde ~.
    Access use the tilde to prefix the hidden queries kept internally as recordsource for controls and forms.

Database design

  • Always use Primary keys.
    Always have a non-null primary key column in every table.
    All my tables have an autonumber column called ID. Using an automatically generated column ID guarantees that each record in a table can be uniquely identified.
    It’s a painless way to ensure a minimum level of data integrity.

  • Do not use complex multivalue columns.
    Access 2007 introduced complex columns that can record multiple values.
    They are in fact fields that return whole recordset objects instead of simple scalar values. Of course, this being an Access 2007 only feature, it’s not compatible with any other database. Just don’t use it, however tempting and convenient it might be.
    Instead use a table to record Many-To-Many relationships between 2 tables or use a simple lookup to record lists of choices in a text field itself if you’re only dealing with a very limited range of multivalues that do not change.

  • Do not use the Hyperlink data type.
    Another Access exclusive that isn’t available in other databases.

  • Be careful about field lookups.
    When you create Table columns, Access allows you to define lookup values from other tables or lists of values.
    If you manually input a list of values to be presented to the user, these won’t get transferred when upsizing to SQL Server.
    To avoid having to maintain these lookup lists all over your app, you could create small tables for them and use them as lookup instead; that way you only need to maintain a single list of lookup values.

  • Be careful about your dates.
    Access date range is much larger than SQL Server.
    This has 2 side-effects:
    1) if your software has to deal with dates outside the range, you’ll end-up with errors.
    2) if your users are entering dates manually, they could have made mistakes when entering the year (like 09 instead of 2009).
    Ensure that user-entered dates are valid for your application.

VBA

While most of your code will work fine, there are a few traps that will bomb your application or result in weird errors:

  • Always explicitly specify options when opening recordsets or executing SQL.
    With SQL Server, the dbSeeChange is mandatory whenever you open a recordset for update.
    I recommend using dbFailOnError as well as it will ensure that the changes are rolled back if an error occurs.

    Dim rs as DAO.RecordSet
    ' Open for read/write
    set rs = db.OpenRecordSet("Stock", dbOpenDynaset, dbSeechanges + dbFailOnError)
    ' Open for read only
    set rs = db.OpenRecordSet("Stock", dbOpenSnapshot)
    ' Direct SQL execution
    CurrentDB.Execute "INSERT INTO ...",  dbSeeChanges + dbFailOnError
    
  • Get the new autonumbered ID after updating the record.
    In Access, autonumbered fields are set as soon as the record is added even if it hasn’t been saved yet.
    That doesn’t work for SQL Server as autonumbered IDs are only visible after the records have been saved.

    ' Works for Access tables only
    ' We can get the new autonumber ID as soon as the record is inserted
    rs.AddNew
    mynewid = rs!ID
    ...
    rs.Update
    
    ' Works for ODBC and Access tables alike
    ' We get the new autonumber ID after the record has been updated
    rs.AddNew
    ...
    rs.Update
    rs.Move 0, rs.LastModified
    mynewid = rs!ID
    
  • Never rely on the type of your primary key.
    This is more of a recommendation but if you use an autonumbered ID as your primary key, don’t rely in your code or you queries on the fact that it is a long integer.
    This can become important if you ever need to upsize to a replicated database and need to transform your number IDs into GUID.
    Just use a Variant instead.

Parting thoughts

These simple rules will not solve all your problems but they will certainly reduce the number of issues you’ll be faced with when upsizing you Access application.
Using a tool like SSMA to upsize will then be fairly painless.

If you have other recommendations, please don’t hesitate to leave them in the comments, I’ll regularly update this article to included them.

References

1 comment   |   Filed under :  Database,MSAccess,Programming,SQL Server

Tuesday, March 17, 2009

MS Access: upsizing to SQL Server 2008

Microsoft Access I’m currently researching ways to move my main MS Access application from a simple local network client/backend setup to a global, multiple remote sites configuration using SQL Server.

One of the challenges is to upsize the current MS Access accdb backend database to SQL Server 2008. If you try it from Access itself using the Upsizing Wizard, you may end up getting this error message:

The Upsizing Wizard only works with Microsoft SQL Server (Versions 6.50 SP5 or higher). Please log in to a SQL Server data source.

The Upsizing Wizard only works with Microsoft SQL Server (Versions 6.50 SP5 or higher). Please log in to a SQL Server data source.

After spending some time fiddling around with SQL Server settings I couldn’t understand why I was still getting this error.
Turns out that the upsizing wizard is apparently sensitive to the version of SQL Server you’re using and it doesn’t consider SQL Server v10 (2008) as being later than v6.50…

This issue is in fact a blessing.
Microsoft provides a migration tool for upsizing MS Access database to SQL Server 2008 that’s orders of magnitude better than anything the basic wizard can do: the SQL Migration Assistant for Access, or SSMA.

Migrate from Access to SQL ServerSSMA lets you take a bunch of Access databases and move the tables and queries you choose to SQL Server, automatically linking them in your original database if you want.
It’s not just a one-off thing either: SSMA keeps track of the objects that where transferred and allows you to synchronise both schema and data as often as you need.

So here you are: do not use the basic MS Access Upsizing Wizard, download and use SSMA instead.

Strange COM Error

While SSMA works perfectly fine on my Windows 2008 x64 laptop, on my main Windows XP desktop it throws an exception when trying to load an Access database:

Unable to cast COM object of type ‘Microsoft.Office.Interop.Access.Dao.DBEngineClass’ to interface type ‘Microsoft.Office.Interop.Access.Dao._DBEngine’
… {00000021-0000-0010-8000-00AA006D2EA4}…

It was a COM error saying that the library for DAO couldn’t be loaded.

I couldn’t find any relevant information on the web.
After a while, I had a look at the DAO driver in
C:\Program Files\Common Files\Microsoft Shared\DAO\dao360.dll
and I noticed that the filename was in blue: that reminded me that I had set compression on the filesystem.

I disabled compression for the file and, magically, SSMA worked again…

Moral of the story: be careful about compressing your filesystem, some registered libraries and system files may work in unpredictable ways…

References

Updates

  • 23FEB2012: Added link to SSMA Team blog and updated download link to current version (5.2)
  • 17MAR2009: Added section on the strange COM error.
  • 14MAR2009: Added links to Tony’s Access MVP website.
  • 05JAN2009: Original publication.

30 comments   |   Filed under :  Database,MSAccess,Programming

Wednesday, March 11, 2009

Technical ebooks: some publishers don’t get it

Stupid thingsI like buying technical books.
Unfortunately, here in Hong Kong, we have less choice: there are not that many technical bookstores that cater for English-language books and the selection is rather limited.

So whenever a book isn’t available here, I buy it online as a PDF.
It’s cheaper, saves postage and I can keep them handy on my desktop computer at work and my laptop.

I love Safari and being able to access such a large library of books online in such a flexible way is great, but if you’re not in the US, the experience is not always that pleasant, with the site sometimes becoming a bit too slow for comfort.

GraphitiTek by Charles Kalpakian

The publishers I regularly buy ebooks from are O’Reilly, when they make the PDF available, and Apress.

O’Reilly’s PDF may have your subscribtion details embedded at the bottom of each page.
It’s perfectly reasonable and doesn’t ruin the experience: I can still search, copy snippets to the clipboard and print the book if I need to keep an annotated copy.

Apress encrypt the PDF with your account email. Again, that’s fine by me, they don’t prevent me from using the book and it’s not a great annoyance to have to type the password to unlock the book.

Now comes Wrox (Wiley): they publish fine books and even have some available as ebooks.
The biggest issue I have though is that they assume that people who buy their books are all potential criminals:

  • The book is as expensive in paper as it is in ebook format. That can’t be right: ebooks have zero reproduction cost while paper books have huge material costs.
  • The ebook version needs to be registered with Adobe Digital Editions that locks it to a particular machine.
  • You’re only allowed to download the ebook to 4 different PCs and you’ve got 2 weeks to do so.
    This seems fair, but it’s not: if I change OS or PC, I’ve already burnt 2 licenses.
  • You can’t copy/paste more than a page at a time, up to 10 pages every week… that’s just a bit silly.
  • Can’t copy any of the artwork, diagrams, etc.
  • Doesn’t say anything about what happens if Adobe drops their particular DRM software or if I need to use the book on the next laptop I’ll buy a year from now.
  • Adobe Digital Edition only supports Windows and Mac and a Sony reader. So using Linux (even though Wrox plublishes books about it) or a Kindle or any other mobile device is out of the question.

So the net advantage of buying an eBook from Wrox (Wiley) is: your guess is as good as mine.

Yeah, you can buy each chapter as a PDF: great value, at US$4.99 per chapter, the book is costing you nearly US$100. You can get the fully printed version for half the cost…
Still, I’ll concede that being able to download a particular chapter can be useful.

The bottom line is: if your book gets printed, it’s going to be pirated and distributed for free in ebook format within a couple of weeks of being published.
While thinking they are protecting their copyright, Wiley is in fact punishing people who buy their books.

I’ll stick with Apress and O’Reilly I think. At least I don’t feel treated like a thief.


12FEB2010 Update: Wrox is now trying DRM-free PDF ebooks. We can only applaud this decision. While the offering is still limited at the moment, most or all of their books will eventually be available DRM-free.


Bookshelf by Charles Kalpakian

3 comments   |   Filed under :  Books,Business,Reviews

Wednesday, February 4, 2009

Sysadmin: Multiple ISP firewall – The setup

Linux After suffering broadband trouble for the past 9 months, including interruptions that lasted a few days, I decided to get an additional line installed by a different ISP.
I could have bought one of these multi-WAN devices but decided against it for a couple of reasons: I like a challenge and I wanted to achieve a particular setup that I wasn’t sure could be answered by off-the-shelf products (for a reasonable price that is).

This long article is fairly detailed but if your setup is similar it should be enough to get you going quickly.

The basic setup

Without further ado, this is the network configuration:

Network Diagram

Notable things

We have 2 broadband connections:

  • CYBR, a standard DSL line with a fixed IP 111.99.88.77 allocated through PPPoE.
  • HKBN, a standard 100Mbps line with a fixed IP 30.40.50.62.

The network is split into different zones:

  • the Internet zone, connected to our Firewall through interfaces eth0 (ppp0) and eth1.
  • a Firewall zone, delimited by the firewall system itself
  • a DMZ zone connected through interface eth2 for the servers we want to make visible from the Internet.
    The DMZ has its own private subnet delimited by 192.168.254.0/255.255.255.0.
  • a LAN zone connected through interface eth3 so local computers can access the Internet and be protected from it.
    The DMZ has its own private subnet delimited by 192.168.0.0/255.255.255.0.

Objectives

What we want from our setup:

  1. our firewall protects our DMZ and LAN from unwanted access.
  2. our win server can host websites or other services.
  3. our linux server can handle receiving and sending email or other services.
  4. our firewall can handle incoming traffic from either ISP.
  5. our firewall can load-balance local outgoing traffic across both ISP.
  6. If one line fails, incoming traffic switches to the other line.
  7. If one line fails, outgoing traffic switches to the other line.
  8. Eventually, we want both the linux and win servers to be able to host different websites and we want the firewall to send incoming requests to the right server.

In this first article, I’ll present the setup for items 1-5.
The remaining topics will be the subject of subsequent articles of their own.

Technologies

The firewall is our primary subject. What is being discussed here is pretty much distribution-independent and should work on all flavours of Linux.

OS on the firewall system

I chose CentOS on the firewall.
Being an almost byte-for-byte identical copy of RedHat Enterprise Linux, all configuration will be identical on RedHat and its derivatives such as Fedora.

Firewall software, first try

When my firewall needs are simpler, I use the Stronger IP Firewall Ruleset from the Linux IP Masquerade HOWTO.
I started to modify the script to adapt it to my new Multi-ISP setup but things got complicated once I needed to debug routing tables.
I got it 80% of the way but tracing network connections and packet routing is complicated and time-consuming.
After a couple of days of peering into log files and wireshark capture screens, I gave up manual configuration and decided to go with something else.

Firewall software, final

The product I chose in the end is shorewall: it’s a very flexible firewall system that create the necessary iptable rules and configure most of the routing needs to properly handle complex network setup.
Shorewall is Open Source, very stable, has been out for a long time, is actively maintained and has lots of excellent documentation and examples.

Things to know

Before we get into the meat of the article, you should brush up on the following topics:

  • You have some knowledge of Linux system administration, know how to configure network connections, know how to enable/disable/stop/start services, able to edit config files.
  • Networking: you should know what a netmask is, what a gateway is, what a subnet is and have a passing understanding of IP classes, IP notation, what ports are for, what’s the difference between the tcp, udp, icmp protocols, what Dynamic Port Forwarding (DNAT) is, what Network Address Translation (NAT) is, what masquerading means.
  • Some basic understanding of DNS and local host name resolving (using host.conf and resolv.conf)
  • Some basic knowledge of what routing is for and how it works.
  • Some knowledge of how the linux kernel handles network packets (NetFilter, basics of iptables).

You don’t need to be a specialist in any of these areas but any knowledge helps.
I’m far from being well versed into Netfilter and routing, it’s not often that I have to deal directly with these topics, but brushing up on these topics helped.

Things to read

Shorewall has very extensive documentation. So much so that it can be a bit daunting, not knowing where to start.
I found the following documents helpful to get me started:

Installing shorewall

Go to the download site list [http://shorewall.net/download.htm#Sites] and download the most appropriate binary package for your distribution.

If you get RPMs for RedHat systems, you only need to install (rpm -ivh) the following packages:

    shorewall-4.X.X-X.noarch.rpm
    shorewall-perl-4.X.X-X.noarch.rpm 

If you install from source, only download, compile and install the common, doc and perl packages.

Preparing the system

For shorewall to properly handle both our firewall and packet routing needs, we need to make sure that the other parts of the system are not interfering with it.

Internet lines

Make sure that your multiple internet lines are properly working on their own!

Disable routing

  • Make sure that you don’t define a GATEWAY in the configuration of your network interfaces (in /etc/sysconfig/network-scripts/ifcfg-XXX) .
  • If you use an (A)DSL connection, also set DEFROUTE=no if its ifcfg-XXX file as well.
  • Remove the GATEWAY from the /etc/sysconfig/network file if there is one.
  • Edit your /etc/sysctl.conf file and set net.ipv4.conf.default.rp_filter = 0.

Disable firewall

Disable the current firewall, for instance using the system-config-securitylevel helper tool.
Be careful if you’re directly connected to the Internet, you will be left without protection!
You can actually wait until shorewall is properly configured to disable the firewall.

Shorewall configuration

Shorewall uses a set of simple configuration files, all located under /etc/shorewall/. For exact detail of each configuration files, have a look at the list of man pages.

Zones

zones are probably the simplest configuration file.
Details in the zones man page. Here we just name the various zones we want our firewall to handle:

    ################################################################
    #ZONE   TYPE          OPTIONS       IN                  OUT
    #                                   OPTIONS             OPTIONS
    fw      firewall
    net     ipv4
    loc     ipv4
    dmz     ipv4

This just reflects our setup as highlighted in the diagram above.

Note that the fw zone is often referred to as the $FW variable instead in various configuration files.

Interfaces

Here we list all the network interfaces connected to our firewall and for which zone they apply.
Details in the interfaces man page.

    ################################################################
    #ZONE   INTERFACE       BROADCAST       OPTIONS
    net     ppp0            detect
    net     eth1            detect
    dmz     eth2            detect
    loc     eth3            detect

Note that for our net zone, we list the 2 interfaces connected to our ISPs.
If you’re using PPPoE to connect, don;t use the interface name eth0 but use ppp0 instead.

Policy

The policy file tells shorewall which default actions should be taken when traffic is moving from one zone to another.
These default actions are taken if no other special action was specified in other configuration files.
View the policy file as a list of default actions for the firewall.
Details about this configuration file as in its man page.

    ################################################################
    #SOURCE DEST    POLICY          LOG     LIMIT:      CONNLIMIT:
    #                               LEVEL   BURST       MASK
    net     net     DROP            info
    loc     net     ACCEPT
    dmz     net     ACCEPT
    loc     dmz     ACCEPT
    loc     $FW     ACCEPT
    dmz     $FW     ACCEPT
    $FW     net     ACCEPT
    dmz     loc     DROP            info
    net     all     DROP            info
    all     all     DROP            info

Traffic from one zone to another needs to be explicitely ACCEPTed, REJECTed or DROPped.
For instance, loc net ACCEPT means that we allow all traffic from our local LAN to the Internet, while net all DROP means we don’t allow incoming traffic from the internet to anyone (remember this is the default action, in most cases we will override this for specific types of traffic in the rules file).
When we set the default action to DROP, we can tell shorewall to keep a trace of the details in the /var/log/messages log.

Providers

The providers file is generally only used in a multi-ISP environment.
Here we define how we want to mark packets originating from one ISP with a unique ID so we can tell the kernel to route these packets to the right interface.
Not doing this would get packets received from one interface to be routed to the default gateway instead.
The details of this configuration file are explained in the providers man page for it.

    #############################################################################
    #NAME NUMBER MARK DUPLICATE INTERFACE GATEWAY      OPTIONS          COPY
    CYBR  1      0x1  main      ppp0      -            track,balance=1  eth2,eth3
    HKBN  2      0x2  main      eth1      30.40.50.61  track,balance=5  eth2,eth3

Note that the DUPLICATE columns tells shorewall that it should make a copy of the main default routing table for this particular routing table (called CYBR or HKBN depending on which ISP we refer to).
Packets are marked with number 0x1 or 0x2 so we can distinguish them during their travel through the system.
For PPPoE connections, don’t specify a GATEWAY since it’s most likely that your ISP didn’t give you one.

The most interesting part of this file are the OPTIONS: track means that we want the packets to be tracked as they travel through the system; balance tells the kernel that we want traffic coming out to be spread over both interfaces.
Additionally, we want HKBN to receive more or less 5 times more traffic than CYBR (note that this has no effect on reply packets).

The COPY columns will ensure that the routing tables created for CYBR and HKBN are copied for each internal interface, so our eth2 and eth3 interfaces know how to route packets to the right ISP.

Route Rules

For our purpsose, the route_rules file only describes how traffic should be routed through one or the other ISP we set up in /etc/shorewall/providers.
Details are in the route_rules file man page.

    #####################################################################
    #SOURCE             DEST               PROVIDER        PRIORITY
    ppp0                -                  CYBR            1000
    eth1                -                  HKBN            1000

Here we simply say that all traffic through the CYBR table should be sent to ppp0.
The PRIORITY is an ordering number that tell shorewall to consider this routing rule before it marks the packets. Since we know the packets originated from ppp0 or eth1 we don’t really need to mark them.

Masq

The masq file will contain the masquerading rules for our private interfaces: in essence, we want traffic from the local LAN and DMZ to be hidden behind our limited number of external IPs.
See the masq manpage for all the details.

    #####################################################################
    #INTERFACE              SOURCE           ADDRESS                 
    # Ensure that traffic originating on the firewall and redirected via 
    # packet marks always has the source IP address corresponding to the 
    # interface that it is routed out of.
    # See http://shorewall.net/MultiISP.html#Example1
    ppp0                    30.40.50.62      111.99.88.77
    eth1                    111.99.88.77     30.40.50.62
    ppp0                    eth2             111.99.88.77
    eth1                    eth2             30.40.50.62
    ppp0                    eth3             111.99.88.77
    eth1                    eth3             30.40.50.62

The first part ensures that the traffic coming out of our public interfaces but originating from the other is actually rewritten as originating from the right IP for the interface.
This ensures that packets leaving eth1 for instance don’t come out with the wrong source address of the other interface.
The second part of the ensures that packets from our LAN or DMZ leaving either public interfaces are doing so with the right IP address, so traffic from my desktop going through ppp0 for instance, will have its source address as 100.90.80.70.

Rules

This is the main file where we tell shorewall our basic configuration and how we want packets to be handled in the general case.
The /etc/shorewall/rules file contains the specific instructions on where to direct traffic that will override the default actions defined in the /etc/shorewall/policy file.

    #####################################################################
    #ACTION    SOURCE                DEST                   PROTO  
    #                                                                     
    SECTION NEW
    # Drop and log packets that come from the outside but pretend 
    # to have a local address
    DROP:info  net:192.168.0.0/24    all
    DROP:info  net:192.168.254.0/24  all

    # Redirect incoming traffic to the correct server for WWW and email
    DNAT       all                   dmz:192.168.254.20     tcp   www
    DNAT       all                   dmz:192.168.254.10     tcp   110
    DNAT       all                   dmz:192.168.254.10     tcp   143
    DNAT       all                   dmz:192.168.254.10     tcp   25

In its most basic form, what we’ve just defined here is that we want all traffic from anywhere destined for port 80 (www) to be sent to our win server.
All mail traffic, POP3 (port 110), IMAP (port 143) and SMTP (port 25) is to be redirected to our linux server in the DMZ.

There are a few more useful rules that we can include, for instance, I want to be able to access my servers through either ISPs from home (IP 123.45.67.89) and disallow everyone else from accessing it.

    #####################################################################
    #ACTION    SOURCE                DEST                   PROTO  
    #                                                                     
    # Allow SSH to the firewall from the outside only from home

    ACCEPT     net:123.45.67.89      $FW                    tcp   ssh
    # Redirect input traffic to the correct server for RDP, VNC and SSH 
    DNAT       net:123.45.67.89      dmz:192.168.254.10:22  tcp   2222
    DNAT       net:123.45.67.89      dmz:192.168.254.10     tcp   5901
    DNAT       net:123.45.67.89      dmz:192.168.254.20     tcp   3389

When I SSH to 30.40.50.62 or 100.90.80.70, on the normal port 22, I will access the firewall.
Now if I SSH to the non-standard port 2222, I will instead access the linux server.
Ports 5901 are for remoting through VNC on the linux machine, and port 3389 will be used for Remote Desktop connections to the win server.

To make sure my machines are up and running, I like to be able to ping them:

    #####################################################################
    #ACTION    SOURCE              DEST              PROTO  
    #                                                                     
    # Accept pings between zones
    ACCEPT     dmz                 loc               icmp  echo-request
    ACCEPT     loc                 dmz               icmp  echo-request

Note that ping will only work between the LAN and the DMZ and pinging my firewall from the Internet will result in the requests being silently dropped.
I usually prefer that configuration as it makes discovering the servers by random bots slightly less likely.

There are lots of other cool things we can do with forwarding but that will do for now.

shorewall.conf

The last file we’re going to look at is the main configuration file for shorewall.
See details about each option from the man page for shorewall.conf.

Most options are OK by default. The only ones that I have had to change are:

    STARTUP_ENABLED=Yes
    MARK_IN_FORWARD_CHAIN=Yes
    FASTACCEPT=Yes
    OPTIMIZE=1

The first option tells shorewall that we want it to start automatically when the system boots.
That’s not enough though, so make sure that the service will be started:

    # chkconfig shorewall --levels 235 on

Installing our firewall rules

Shorewall configuration files need to be compiled without error before the firewall is actually loaded by shorewall.
The command:

    # shorewall restart

will stop and recompile the current configuration.
If there are any errors, the current firewall rules will be unchanged.
There are lots of other commands that can be issued. Check the man page for a complete list.

If you use PPPoE, you will want the firewall to be restarted every time the line reconnects.
The simplest way is to create a file /etc/ppp/if-up.local with only a single line:

    shorewall restart

DNS

There is one remaining issue with our firewall: if a user on the LAN attempts to access the web server by its name the request will probably fail.
Same for accessing our mail server: we can configure our desktop to connect to 192.168.254.10 to get and send emails, but on the laptop we would usually use something like pop.acme.com instead so we can read our emails from outside the office.

Similarly, trying to access www.acme.com hosted on the win server from the linux server will fail.

One solution is to route traffic through the firewall but that’s actually fairly complicated to setup properly.
The shorewall FAQ 2 discourages this and instead recommends the use of split-DNS: it’s very easy to setup and it works like a charm.

dnsmasq

Just install dnsmasq on the firewall. There are ready-made packages available for it and a simple yum install dsnmasq should suffice.

Dnsmasq provides a simple DNS forwarding and DHCP service. I had already configured dhcpd -which is already fairly simple to configure- on my firewall so I won’t need DHCP from dnsmasq but you can easily set it up if you want.

On the DNS side, dnsmasq can be told to first try to resolve hostnames by looking at the standard /etc/hosts file and then query the DNS servers defined in /etc/resolv.conf if necessary.

This simple trick means that we can:

  • Keep our normal DNS service pointing to say 100.90.80.70 for www.acme.com so that people on the Internet will properly resolve their web requests to our win server.
  • Add an entry in the firewall’s hosts file to point local clients to 192.168.254.20 instead.

To achieve this, simply edit /etc/hostsand add entries matching all your services:

    # Acme's services. 
    # One line for each DNS entries accessible from the Internet
    192.168.254.20        acme.com
    192.168.254.20        www.acme.com
    192.168.254.10        pop.acme.com
    192.168.254.10        mail.acme.com

dsnmasq configuration

Edit the /etc/dsnmasq.conf and uncomment or add the following lines:

    # Never forward plain names (without a dot or domain part)
    domain-needed
    # Never forward addresses in the non-routed address spaces.
    bogus-priv
    # listen on DMZ and LAN interfaces
    interface=eth2
    interface=eth3
    # don't want dnsmasq to provide dhcp
    no-dhcp-interface=eth2
    no-dhcp-interface=eth3

Then make sure that dsnmasq will start on boot:

    # chkconfig dnsmasq --levels 235 on
    # service dnsmasq restart

DNS resolution

There may be one last issue with DNS: in your /etc/resolv.conf you will have listed the DNS servers of one or both of your ISPs.
The problem is that some ISPs don’t allow access to their name servers from a network different than theirs.

The result is that each time any of the systems issues a DNS request it may fail and need to be sent to the next server instead, which may also fail and introduce delays in accessing named resources on the Internet.

One easy way out is to not use the ISPs DNS servers but instead only list the free OpenDNS name servers in your resolv.conf:

    search acme.com
    nameserver 208.67.222.222
    nameserver 208.67.220.220

Then make sure that you disable DNS in your /etc/sysconfig/network-config/ifcfg-XXX configuration file for your PPPoE connection:

    PEERDNS=no

Failure to do so will result in your /etc/resolv.conf file being rewritten with the DNS servers of one of your ISP every time you reconnect to them.

DHCP configuration

If you use dhcpd for local users, then you will need to make sure that its DNS server is set to the firewall’s:

    # DHCP Server Configuration file.
    ddns-update-style none;
    ignore client-updates;

    subnet 192.168.0.0 netmask 255.255.255.0 {
        option routers                  192.168.0.1;
        option subnet-mask              255.255.255.0;
        option domain-name              "acme.com";
        option domain-name-servers      192.168.0.1;
        range 192.168.0.200 192.168.0.250;
        default-lease-time 86400;
        max-lease-time 132000;
    }

On your local machines that use DHCP, make sure to renew your IP.
All other machines should be configured to use 192.168.0.1 as their unique DNS server and the machines in the DMZ should have their DNS set to 192.168.254.1.

Unless you reboot, don’t forget and flush the local DNS cache of each machine:
On Windows, from the command line:

    C:\> ipconfig /flushdns

On Mac, from the terminal:

    bash-x.xxx$ dnscacheutil -flushcache

Initial conclusions

I believe this type of firewall setup is fairly common and I hope that the -rather long- article helped you get your own setup in place.
In the -much shorter- follow-up articles, we’ll make our system as redundant as possible so our web and email services stay online even when one of the broadband connections fails.

In the meantime, don’t hesitate to leave your comments and corrections below.

History

References

20 comments   |   Filed under :  Linux,Security,sysadmin,Web

Monday, January 19, 2009

Sysadmin: file and folder synchronisation

Technology Over the years I’ve struggled to keep my folder data synchronised between my various desktop and laptops.

Here I present the tools I’ve tried and what I’ve finally settled on as possibly the ultimate answer to the problem of synchronising files and folders across multiple computers:

Sync Files

rsync

I’ve tried rsync, which is a great Open Source tool to securely synchronise data either one-way or both-ways.
It’s very efficient with bandwidth as it only transfer blocks of data that have actually changed in a file instead of the whole file. It can tunnel traffic across SSH and I’ve got a few cronjobs set up between various servers to back-up files daily.

It’s only weaknesses are that:

  • Every time it runs, it needs to inspect all files on both sides to determine the changes, which is quite an expensive operation.
  • Setting up synchronisation between multiple copies of the data can be tricky: you need to sync your computers in pairs multiple times, which quickly becomes expensive and risky if you have the same copy across multiple computers.
  • It doesn’t necessarily detect that files are in use at the time of the sync, which could corrupt them.

unison

It a folder synchronisation tool whose specific purpose is to address some of the shortcomings of rsync when synchronising folders between computers. It’s also a cross-platform Open Source tool that works on Linux, OS/X, Windows, etc.

Unison uses the efficient file transfer capabilities of rsync but it is better at detecting conflicts and it will give you a chance to decide which copy you want when a conflict is detected.

The issue though is that, like rsync, it needs to inspect all files to detect changes which prevents it from detecting and propagating updates as they happen.

The biggest issue with these synchronisation tools is that they tend to increase the risk of conflict because changes are only detected infrequently.

WinSCP

WinSCP Is an Open Source Windows GUI FTP utility that also allows you to synchonise folders between a local copy and a remote one on the FTP server.

It has conflict resolution and allows you to decide which copy to keep.

It’s great for what it does and allows you to keep a repository of your data in sync with your local copies but here again, WinSCP needs to go through each file to detect the differences and you need to sync manually each computer against the server, which is cumbersome and time consuming.

General Backup tools

There are lot more tools that fall into that category of backup utilities: they all keep a copy of your current data in an archive, on a separate disk or online. Some are great in that they allow you to access that data on the web (I use the excellent JungleDisk myself) but file synchronisation is not their purpose.

Now for some Captain Obvious recommendation: remember that file synchronisation is not a backup plan: you must have a separate process to keep read-only copies of your important data.
File synchronisation will update and delete files you modify across all your machines, clearly not what you want if you need to be able to recover them!

Revision Control Systems

Revision control software like cvs, subversion, git, etc are generally used to keep track of changes of source code files; however, they have also been used successfully to keep multiple copies of the same data in sync.
It’s actually exactly what I use for all my source code and associated files: I have a subversion server and I check-out copies of my software project folders on various computers.

After making changes on one computer, I commit the changes back to the server and update these changes on all other computers manually.

While great at keeping track of each version of your files and ideally suited to pure text documents like source code, using revision control systems have drawbacks that make them cumbersome for general data synchronisation:

  • you need to manually commit and update your local copies against the server.
  • not all of them are well suited to deal with binary files
  • when they work with binary files, they just copy the whole file when it changed, which is wasteful and inefficient.

Revision Control System are great for synchronising source code and configuration files but using them beyond that is rather cumbersome.

Complex setup

All of the above solutions also have a major drawback: getting them to work across the Internet requires complex setup involving firewall configurations, security logins, exchange of public encryption keys in some cases, etc.

All these are workable but don’t make for friendly and piece-of-mind setup.

What we want from data synchronisation

I don’t know about you but what I’m looking for in a synchronisation tool is pretty straightforward:

  • Being able to point to a folder on one computer and make it synchronise across one or multiple computers.
  • Detect and update the changed files transparently in the background without my intervention, as the changes happen.
  • Be smart about conflict detection and only ask me to make a decision if the case isn’t obvious to resolve.

Live Mesh folders

Enters Microsoft Live Mesh Folders, now in beta and available to the public. Live Mesh is meant to be Microsoft answer’s to synchronising information (note, I’m not saying data here) across computers, devices and the Internet.
While Live Mesh wants to be something a lot bigger than just synchronising folders, let’s just concentrate on that aspect of it.

Installing Live Mesh is pretty easy: you will need a Windows Live account to log-in but once this is done, it’s a small download and a short installation.

Once you’ve added your computer to your “Mesh” and are logged in you are ready to use Live Mesh:

  • You decide how the data is synchronised for each computer participating in your Mesh:
    you’re in charge of what gets copied where, so it’s easy to make large folders pair between say your laptop and work desktop and not your online Live Desktop (which has a 5GB limit) or your computer at home. You’re in control.
  • Files are automatically synchronised as they change across all computers that share the particular folder you’re working in.
    If the file is currently used, it won’t be synced before it is closed.
  • If the other computers are not available, the sync will automatically happen as they are up again.
  • There is no firewall setup: each computer knows how to contact the others and automatically -and uses- the appropriate network: transfers are local if the computers are on the same LAN or done across the Internet otherwise.
    All that without user intervention at all.
  • Whenever possible, data is exchanged in a P2P fashion where each device gets data from all the other devices it can see, making transfers quite efficient.
  • File transfers are encrypted so they should be pretty safe even when using unsafe public connections.
  • If you don’t want to allow sync, say you’re on a low-bandwidth dialup, you can work offline.
  • The Mesh Operating Environment (MOE) is pretty efficient at detecting changes to files. Unlike other systems, in most cases it doesn’t need to scan all files to find out which ones have been updated or deleted.

Some drawbacks

  • It’s not a final product, so there are some quirks and not all expected functionalities are there yet.
  • The Mesh Operating Environment (MOE) services can be pretty resource hungry, although, in fairness, it’s not too bad except that it slows down your computer’s responsiveness while it loads at boot time.
  • You can’t define patterns of files to exclude in your folder hierarchy.
    That can be a bit annoying if the software you use is often creating large backup files automatically (like CorelDraw does) or if there are sub folders you don’t need to take everywhere.
  • The initial sync process can take a long time if you have lots of files.
    A solution if you have large folders to sync is to copy them first manually on each computer and then force Live Mesh to use these specific folders: the folders will be merged together and the initial sync process will be a lot faster as very little data needs to be exchanged between computers.

Bear in mind that Live Mesh is currently early beta and that most of these drawback will surely be addressed in the next months.

Conclusion

I currently have more than 18GB representing about 20,000 files synchronised between 3 computers (work desktop, laptop and home desktop) using Live Mesh.

While not 100% there, Live Mesh Folder synchronisation is really close to the real thing: it’s transparent, efficient, easy to use and it just works as you would expect.

Now that Microsoft has released the Sync Framework to developers, I’m sure that other products will come on the market to further enhance data synchronisation in a more capable way.
In the meantime, Live Mesh has answered my needs so far.

References

3 comments   |   Filed under :  Software,sync,sysadmin

Thursday, January 8, 2009

Sysadmin: Recovering deleted Windows partitions

TechnologyI made a mistake the other day: I wanted to delete the partition on an external drive and in my haste ended up deleting the partition of a local hard drive instead…

The good thing is when you delete a partition using the Windows Disk Management console it doesn’t actually delete your files, only the partition header.

Windows Disk Management Console

With NTFS files systems, there is a backup at the end of the partition. The problem is how do you recover it?

I first looked at the instructions from Microsoft knowledge base article kb245725, downloaded the low-level sector editor Dskprobe but was getting no-where with it.

Searching google brings you to the usual list of recovery software that you can’t be sure will actually do the job until you fork $$ for them.
I’ve got nothing against paying for software but I’ve been bitten by false promises before.

My search ended up with TestDisk an OpenSource utility to manipulate and recover partitions that works on almost all platforms.
The user interface is DOS only, so it’s not pretty, not point-and-click user friendly but it has a fair amount of options and after fiddling around with it for 10 minutes, I was able to simply recover the backup boot sector and tada! all my files were back!

TestDisk in action

So, some recommendations when recovering lost partitions:

  • Don’t panic! If you only deleted the partition (whichever type), chances are you’re likely to recover it or at least salvage the files.
  • Obviously, be careful not to write anything over them, like recreating partitions and a file system.
  • If you use a utility like TestDisk, don’t blindly follow the on-screen instructions. At first, it was telling me that I had 2 Linux partitions on the device (which used to be true) but it did not see the NTFS one. Then it thought I had a FAT partition only until I switched to the advanced options and inspected the boot partition.
    Just know enough about file systems to know what you’re looking for.
  • Low-level tools are not for everyone, so if you’re not comfortable using them, don’t tempt your luck and try a paid-for recovery tool with an easier interface.

If you use TestDisk and you manage to recover your files, don’t forget to donate to encourage Christophe GRENIER, the author.

References

4 comments   |   Filed under :  sysadmin

Thursday, November 6, 2008

.Net: Working with OpenOffice 3

technology02.png(Updated Wednesday 30JAN2009.) Developing applications that manipulate OpenOffice documents has always been rather tricky; not very difficult, but just tricky to get right.

With OpenOffice 3, things got trickier and applications that used to work will now fail. I’ve just spend nearly a day trying to get a mail-merge application I built to work again with the new version of OO.
OOo3

Changes and limitations

Developing .Net applications with OpenOffice 3 now requires that the .Net framework 3.5 be installed.
Only one version of OpenOffice/StarOffice must be installed.

Getting the CLI assemblies in your project

OpenOffice used to ship with a bunch of CLI DLLs for manipulating Open Documents from .Net. With OpenOffice version 3, these DLLs are directly installed in the GAC and not available in the Program installation folder any longer, making them unavailable when you need to add them as references to your Visual Studio project.

The trick is to extract the DLLs from the installation CAB, then reference those and make sure you exclude them in your setup file so only the latest version installed on the user’s computer’s GAC will get used.

  1. Open the installation folder where the installation files are extracted (usually left on your desktop under a folder like OpenOffice.org 3.0 (en-US) Installation Files during the OO installation process.
  2. Open the openofficeorg1.cab file (using a utility like 7zip if necessary) and extract the files matching cli_*.dl.
  3. Add the ones you need to your VS Project’s references and make sure the properties of these references have their Copy Local and Specific Version properties set to false.
  4. If you have a setup project that added these references to the list of files, right-click each of them and select Exclude to make sure they won’t be packaged.

The reason for not deploying the DLLs is that they are very specific to a given OpenOffice version and the latest assemblies relevant to the user’s OpenOffice installation will already be deployed to the target machine’s GAC.
When .Net loads assemblies, if they are not included in the path of the application being launched, they will be loaded from the GAC if they can be found there.

Path and environment setup

Before being able to call OpenOffice from your application you now must set your application’s environment variables to the correct paths so the assemblies can find the actual OpenOffice library and program files.

Basically, you need to add to your PATH the path to the folder where the UNO java libraries reside. You also need to add a UNO_PATH environment variable that points to the program folder of OpenOffice. Basically, before any call to OpenOffice functions you must:

  • Append to PATH something like C:\Program Files\OpenOffice.org 3\URE\bin.
  • Create a UNO_PATH variable set to C:\Program Files\OpenOffice.org 3\program.

Because there is no guarantee that these paths will not change or are valid for all systems you must get them from specific keys located in the Registry:

  • PATH is appended with the vaue of HKLM\SOFTWARE\OpenOffice.org\Layers\URE\1\UREINSTALLLOCATION to which you must append the bin directory.
  • UNO_PATH is set to the content of the HKLM\SOFTWARE\OpenOffice.org\UNO\InstallPath key.

See the C# and VB.Net code below for working examples.

Special Considerations for x64 systems

My development machine runs Windows Server 2008 x64 and I’ve ran into some specific issues that you’re likely to encounter when deploying to a 64 bits OS.

OpenOffice is 32 bits only
That means that your .Net project must be set to target x86 systems only:
Open your Solution’s Configuration Manager and under Active solution platform click New… then:

Configuration

Make sure you repeat this for both the Debug and Release configurations.

Registry keys are elsewhere
32 bit applications see their registry keys normally expected under:
HKEY_LOCAL_MACHINE\Software moved to:
HKEY_LOCAL_MACHINE\Software\Wow6432Node instead.
This of course creates issues when you’re trying to read a registry key that’s not where it should be…

The Code

The code below will allow you to correctly connect to OpenOffice 3 under 32 or 64 bit systems.
It reads the registry to find the proper paths and appends the PATH and creates the UNO_PATH environment variables expected by the the bootstrapper to find the OpenOffice program and libraries.

The code is built upon information and a test program made available by Marten Feldtmann on his blog (more information, in English, is available on OOoForum ).

Please let me know if this works for you or if you have any corrections.

private void InitOpenOfficeEnvironment() {
  string baseKey;
  // OpenOffice being a 32 bit app, its registry location is different in a 64 bit OS
  if (Marshal.SizeOf(typeof(IntPtr)) == 8)
      baseKey = @"SOFTWARE\Wow6432Node\OpenOffice.org\";    
  else
      baseKey = @"SOFTWARE\OpenOffice.org\";  

  // Get the URE directory
  string key = baseKey + @"Layers\URE\1";
  RegistryKey reg = Registry.CurrentUser.OpenSubKey(key);
  if (reg==null) reg = Registry.LocalMachine.OpenSubKey(key);
  string urePath = reg.GetValue("UREINSTALLLOCATION") as string;
  reg.Close();
  urePath = Path.Combine(urePath, "bin");

  // Get the UNO Path
  key = baseKey + @"UNO\InstallPath";
  reg = Registry.CurrentUser.OpenSubKey(key);
  if (reg==null) reg = Registry.LocalMachine.OpenSubKey(key);
  string unoPath = reg.GetValue(null) as string;
  reg.Close();

  string path;
  path = string.Format ("{0};{1}", System.Environment.GetEnvironmentVariable("PATH"), urePath);
  System.Environment.SetEnvironmentVariable("PATH", path);
  System.Environment.SetEnvironmentVariable("UNO_PATH", unoPath);
}

In VB.Net:

Private Sub InitOpenOfficeEnvironment()
  Dim baseKey As String
  ' OpenOffice being a 32 bit app, its registry location is different in a 64 bit OS
  If (Marshal.SizeOf(GetType(IntPtr)) = 8) Then
    baseKey = "SOFTWARE\Wow6432Node\OpenOffice.org\"
  Else
    baseKey = "SOFTWARE\OpenOffice.org\"
  End If

  ' Get the URE directory
  Dim key As String = (baseKey + "Layers\URE\1")
  Dim reg As RegistryKey = Microsoft.Win32.egistry.CurrentUser.OpenSubKey(key)
  If (reg Is Nothing) Then
    reg = Registry.LocalMachine.OpenSubKey(key)
  End If
  Dim urePath As String = CType(reg.GetValue("UREINSTALLLOCATION"),String)
  reg.Close
  urePath = Path.Combine(urePath, "bin")

  ' Get the UNO Path
  key = (baseKey + "UNO\InstallPath")
  reg = Microsoft.Win32.Registry.CurrentUser.OpenSubKey(key)
  If (reg Is Nothing) Then
    reg = Microsoft.Win32.Registry.LocalMachine.OpenSubKey(key)
  End If
  Dim unoPath As String = CType(reg.GetValue(Nothing),String)
  reg.Close

  Dim path As String
  path = String.Format ("{0};{1}",System.Environment.GetEnvironmentVariable("PATH"),urePath)
  System.Environment.SetEnvironmentVariable("PATH", path)
  System.Environment.SetEnvironmentVariable("UNO_PATH", unoPath)
End Sub

Updates

07MAY2009 – Added reference link to OOo documentation.
03DEC2008 – Added VB.Net translation. Thanks to Stefan for suggesting it.
30JAN2009 – Added reference to Aleksandr Sazonov’s article on CodeProject (thanks for the the article).

References

32 comments   |   Filed under :  .Net,OpenOffice,Programming

Tuesday, October 28, 2008

Windows 2008 / Windows 7 x64: The ‘Microsoft.Jet.OLEDB.4.0’ provider is not registered on the local machine.

TechnologyThere are times when the coexistence of 64 and 32 bit code on the same machine can cause all sorts of seemingly strange issues.
One of them just occurred to me while trying to run the ASPx demos from Developer Express, my main provider of .Net components (the best supplier I’ve ever been able to find).
I was getting the following error:

The ‘Microsoft.Jet.OLEDB.4.0’ provider is not registered on the local machine:

Server Error

It may look otherwise, but this error is generally due to either of two thing:

  • you don’t have Office 2007/2010 Jet drivers installed
  • or you are running a 32 bit application in a default x64 environment.

The first issue is easy to solve, just download the Access 2010 Database Engine from Microsoft (works with Access 2007 databases as well).

For the second one, the fix is also easy enough:

  • For Windows 2008: Navigate to Server Manager > Roles > Web Server (IIS) > Internet Information Services (IIS) Manager, then look under your machine name > Application Pool.
  • For Windows 7: Navigate to Programs > Administrative Tools > Internet Information Services (IIS) Manager, then look under your machine name > Application Pool.

Under there you can call the DefaultAppPool’s advanced settings to change Enable 32-Bits Applications to True:

Advanced Settings

You may have to restart the service for it to take effect but it should work.

References

Updates

  • 10DEC2011: Updated driver link to use the Access 2010 engine.
  • 03APR2010: Added instructions for Windows 7
  • 12FEB2009: Added reference to Scott’s article.
  • 28OCT2008: Original version

100 comments   |   Filed under :  .Net,Database,Programming,sysadmin,Web

Next Posts Previous Posts


about

Renaud This is a simple technical weblog where I dump thoughts and experiences from my computer-related world.
It is mostly focused on software development but I also have wider interests and dabble in architecture, business and system administration.
More About me…

My StackOverflow Profile
My (sporadically active) StackOVerflow account

Most Recent Posts

Categories

Links

Feeds