Search in File contents (pdf, word, excel etc..)

This section will guide you each step to index and search media f contents.

Prerequisites

To search attachment contents, the index table engine is required. Before you start with this tutorial, I highly suggest reading up the introduction to the index table.

Step 1 - Index table configuration

Open up the Index table submenu, located under the Ajax Search Pro main menu.

Choosing the Attachment post type to index

On the General panel, under the Post types to index option, choose the Attachment - Media post type, that will unlock the Media Service and File indexing options.

Registering a Media Service license key (Free version available!) will enable sending the files to an external server for the best and most accurate file indexing.

Once you have a license key, you can this documentation on how to enable it - although it is fairly simple, you just put the key into the input field, and hit the Activate button. Once it is activated, the plugin will attempt to index all file types selected via this parser automatically. That's it!

What happens if I don't want use the Media Parser?

No worries! Then the local, built-in file parser will be used. In most cases they do the job, but they are much less efficient. Please check the Local Parsers section below for more information.

Choosing the file mime types to index

Each attachment has a so-called mime type. The file mime type determines what file the system is dealing with.

Too choose the type, simply scroll down to the File indexing options section, and choose the file types you wish to index.

Entering the mime types manually

If you wish, you can switch to manual mode by clicking the >>Enter Manually<< link.

Here you can find the list of supported mime types.

Enabling file content indexing

After entering the desired mime types, the the file content indexing options will unlock (based on which mime types are entered)

Click on the On/Off buttons to switch which file type contents should be indexed.

Save and Index

After choosing all the desired options, it is time to Save the configuration on the bottom of the page, and then generating the index.

Step 2 - Search instance configuration

We are almost done, now the desired search instance needs to be configured to use the index table for attachments. If you have not created a search instance yet, make sure to do it first.

On the search instance options, go to the Search Sources -> Media Files Search panel. After doing so, change the first two options:

  • Return media files as results: ON

  • Search engine for media results: Index table engine

Save the options, and it is done. The search should return attachments based on their content now.

Local Parsers

When the Media Service is not enabled, the local file parsers are used. Because these have to be executed on your local server, they depend on the local server performance as well, and they are generally less accurate and less efficient.

Minimum requirements & supported formats

Only some of the parser scripts require some standard libraries to be installed/enabled. Usually these modules are enabled on most server hosts by default.

  • For Microsoft Office and Open Office documents - ZipArchive and php-xml PHP modules (enabled on most hosts) Supported file types: .docx, .xlsx, .pptx, .odt, .ott, .odm, .ods, .odp

Older MS office 97-2003 file formats may not work correctly! These include .xls, .doc, .ppt

Indexing other documents is still possible, without meeting these requirements (RTF, TXT, CSV etc..)

Possible limitations - fair use

The local parser libraries are highly optimized, and their performance mostly depends on the actual server performance, however there are a few things to consider when using an average server, that may affect the performance greatly:

  • Document length - documents over 30-60 pages can get very difficult to index, and may fail, especially PDF files. Therefore it is not recommended to use this feature to index long books/documents.

  • File size - documents with large images/attachments can be difficult and costly to read from the servers perspective. Optimally, the document should only contain the text to be indexed, although some graphics should not be an issue at all.

  • Secured or Password protected documents - Secure or password protected documents are not possible to parse.

Use the Media Service feature - it is much more efficient, and you don't have to worry about anything. It will index all documents as accurately as possible.

Last updated

Copyright Ernest Marcinko