Thursday, April 3, 2008

What is Verity K2 server doing

I've been recently plagued with some errors relating to full text search/indexing on a Cold Fusion-based application that uses Verity K2 for searching. I've been seeing indexing times go through the roof - in fact they kind of don't really finish at all (which did not happen before)

So, anyways, digging time produced following interesting things...

Every time a new CFINDEX is issued it is possible to find out what is K2 doing by looking at the {cfroot-install}/verity/Data/services/{unique-indexserver-name}/log/status.log. Of interest to me was:
Status: [VDKCB ws=cf_jrpp-99_workspace] Initializing dataset 00000048.ddd, index 00000048.did
Status: [VDKCB ws=cf_jrpp-99_workspace] Totals (1000 documents): 5000 para 2000 sent 352349 word (4464 Kb used)
Status: [VDKCB ws=cf_jrpp-99_workspace] (20719 ms) Indexed 1000 docs into {PATH}/{COLLECTION-NAME}/parts/00000048
Status: [VDKCB ws=cf_jrpp-99_workspace] Writing partition index data
I was quite surprised to see that the index time was 20 sec for about 1000 documents (I'm getting data from a database). However, my CF page that drives the index just sits there and waits and never goes out of the CFINDEX tag!

To make things more interesting I tried doing a direct explicit CFSEARCH with CFKEY=xxx and got back results from the collection - proof that the element I was indexing was inside the collection.

Few more findings: it appears that the "...workspace...." mentioned in the above log output refers to a file in {cfroot-install}/verity/Data/services/{unique-indexserver-name}/tmp folder. In my case it was working on "cf_jrpp-99_workspace" and I had a cf_jrpp-99_workspace_BIF in the above tmp folder. If you look into that file looks like a complete of all the documents it got to work on.

Now I'm wondering what might be wrong for my pages not to return when the cfindex finishes.

This was with CFMX 7.0.1, with a unicode Verity collection.

I also tried doing a larger set of documents (5000) and got:
Status: K2 Index Server: Waiting for available Async slot
No idea whether something somewhere went wrong but my documents are not yet inside the collection....

Further digging around the file system around the verity/Data/services/ws folder shows there's this workspace-related folder with a folder for each workspace item, so you get "cf_jrpp-88_workspace" and a million other folders by the same name. I wondered who cleans that thing?? Looks like no one does.

After removing all the workspace folders in the ws/ directory indexing started working even better - no issues handling a 5000 document indexing attempt. 100 seconds only! 10,000 documents in about 230 seconds.

Updated 21 Apr 2008:
After further testing, it turns out that there is a chance your existing collections would become kind of unusable with K2 after doing an upgrade of CFMX to 7.0.2. Basically a collection that's working fine in terms of search does not respond well to attempts to delete records from it. It was actually reproducible: create a collection with CFMX 7.0, index some number of documents (I had 200,000+ and 400,000+), upgrade CFMX to 7.0.2 and attempt to delete a few indexed documents. You'll get either the above mentioned "K2 Index Server: Waiting for available Async slot" message in the /data/services/coldfusionk2_indexserviceN/log/status.log or the status would just register a new requrest to delete things but nothing happens after that.

So, the easy way out is simply to re-create and re-index your collections.

No comments: