Sunday, September 16, 2012

OCR with MODI.dll (Microsoft Office Document Imaging)

Before starting with MODI.dll, I have to explain what OCR is. Converting images into text is called OCR (Optical Character Recognition). For example pdf files those are screenshots of a plain text page, can be converted into Microsoft Word files and this process requires OCR and this can be done in many ways.

There are lots of solutions which offers OCR in similar ways, some of them can be multilangual and some others can be better at recognizing only some set of fonts. Some of them can be open source and some others can require some licences.

If you need to use open source library, then I suggest you to start with Google's Tesseract. You can find some different C# wrappers like TessNet2, but it currently uses Tesseract2 and I suggest you to wait for a Tessseract3 wrapper.

I dig into Tesseract and at some point I was training my own font to improve results but then I decided to try other options since training is a little bit tricky and I will share it later. Then I found MODI and it actually does almost perfect OCR except that it requires Microsoft Office licence (I guess).

MODI was actually built in OCR dll which existed in Microsoft Office 2003 and 2007 but not in 2010 directly. They have moved MODI into OneNote 2010 but it is a different world.

If you have 2003 or 2007 Office then you just have to include Microsoft Office Document Imaging in your Office configuration. (Software -> Edit Microsoft Office Installation -> Add/Remove Components -> Office- Tools -> Microsoft Office Document Imaging has to be installed.)

If you do not have Office 2003 or 2007 like me, then you just have to get it from Microsoft website. Most simple way is to download this, and when you launch the setup click custom install, disable all components and under Office-Tools, enable just Microsoft Office Document Imaging. Then complete the install, and you should be able to add MODI as reference in your Visual Studio Solution. Other then that, in Start menu you can find an api for MODI under Microsoft Office Tools and try how good it is.

At this point, you have all you need to start OCR with MODI. For best results, I suggest you to save your images in TIF format, and then run OCR with  MODI. Let me share my sample code, then you can develop your own methods.

I usually save images to a location and then use OCR so you can do that by web request.


            byte[] imageBytes;

            HttpWebRequest imageRequest = (HttpWebRequest)WebRequest.Create(imageUrl);
            WebResponse imageResponse = imageRequest.GetResponse();
            Stream responseStream = imageResponse.GetResponseStream();
            using (BinaryReader br = new BinaryReader(responseStream))
            {
                imageBytes = br.ReadBytes(1500000);
                br.Close();
            }
            responseStream.Close();
            imageResponse.Close();

            FileStream fs = new FileStream(saveLocation, FileMode.Create);
            BinaryWriter bw = new BinaryWriter(fs);
            try
            {
                bw.Write(imageBytes);
            }
            finally
            {
                fs.Close();
                bw.Close();
            }


You can directly save downloaded image as tif or png but most ocr libraries requires TIF with no compression method so that we have to manipulate saved image with following code.

First we get saved image into a Bitmap. Then you should do the manipulations those are necessary. For example, I needed to extend image size by 4, and then I choose interpolation mode, smoothing mode and compositing quality to achieve better image quality for OCR.

Then you should pass encoder info as TIFF and in encoder parameters you should select it as last frame and compression none. Actually in a single tiff file you can add several images, so if you select last frame as multi frames then you can add more images into it. I should write about it as well in another day.


                    Bitmap bmp = new Bitmap("savedImage.png");

                    Bitmap dst = new Bitmap((int)(bmp.Width * 4), (int)(bmp.Height * 4));
                    using (Graphics g = Graphics.FromImage(dst))
                    {
                        Rectangle srcRect = new Rectangle(0, 0, bmp.Width, bmp.Height);
                        Rectangle dstRect = new Rectangle(0, 0, dst.Width, dst.Height);

                        g.InterpolationMode = InterpolationMode.HighQualityBilinear;
                        g.SmoothingMode = SmoothingMode.AntiAlias;
                        g.CompositingQuality = CompositingQuality.GammaCorrected;

                        g.DrawImage(bmp, dstRect, srcRect, GraphicsUnit.Pixel);
                    }

                    ImageCodecInfo encoderInfo = ImageCodecInfo.GetImageEncoders().First(i => i.MimeType == "image/tiff");

                    EncoderParameters encoderParams = new EncoderParameters(2);
                    EncoderParameter parameter = new EncoderParameter(System.Drawing.Imaging.Encoder.Compression, (long)EncoderValue.CompressionNone);
                    encoderParams.Param[0] = parameter;
                    parameter = new EncoderParameter(System.Drawing.Imaging.Encoder.SaveFlag, (long)EncoderValue.LastFrame);
                    encoderParams.Param[1] = parameter;

                    Image tif = (Image)dst;
                    tif.Save("fileName.tif", encoderInfo, encoderParams);


At the end we save our image file with the extension .tif. Here comes the best part where we do OCR. You should initialize MODI and then set our tif file's path and we will be good to go.


                    MODI.Document doc = new MODI.Document();
                    doc.Create("savedImage.tif");
                    doc.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true);
                    MODI.Image img = (MODI.Image)doc.Images[0];
                    MODI.Layout layout = img.Layout;
                    string text = layout.Text;


Now text variable contains our image content and you can use it as you wish.

Have a nice day.

Mehmet.



6 comments:

  1. nice Article! Do you have some Sample code?

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Thank you for MODI sample, It's very nice.

    And I'm looking for 'How to train my own font by tesseract using C#' too.

    ReplyDelete
  4. Thank you for MODI sample, It's very nice.

    And I'm looking for 'How to train my own font by tesseract using C#' too.

    ReplyDelete
  5. Quite interested in OCR technology and I've done some reserch on it and read a lot on guides. For using C# for OCR tesseract, you may need a .net ocr library.

    ReplyDelete
  6. modi is not come up with after office 2010.

    ReplyDelete