Thursday, December 6, 2007

Importing Content into Drupal

There is a useful Drupal module for importing data into your Drupal site. That is the Node Import project which provides support for importing content from CSV files (comma and tab-delimited).

My first impressions are mixed: it does manage to import nodes (primarily into my custom defined content types) but for some strange reason the CCK fields which reference other nodes do not get populated as per the specification during the wizard process.

In addition, when trying to upload a large CSV file the first screen where the file selection takes place just gets refreshed with a red warning that a file must be selected for import without any hint as to what may be wrong with the CSV file. When that happens, try to break-down your CSV into smaller chunks as there may be a wrongly formatted entry in the file that breaks the whole thing.

When on the topic of CSV, the file that I'm using is generated with a very useful CSV tag library by good people from RedBalloon Lab as I was moving things through a Cold Fusion file. However, note that it's not perfect when you're dealing with exports of large text fields which contain HTML markup. I have not yet figured out what exactly gets messed-up but every few hundred entries I get an entry that completely breaks the file.

Final notes on the importing process into Drupal: CCK node references do not get populated (despite the fact that they're supposed to) and when I edited manually the CCK table content_field_yourfieldname which stores the node reference ID I was further surprised that the front end still did not show the updated data (author name was still blank after I've updated the node ID that refers to the author). Turns out that the caching mechanism plays a role and deleting contents from cache_content table forces Drupal to show the last snapshot of data.

[Updated 11 Apr 2008]
Few months later I'm back to importing content. I just figured an important aspect of importing data: you'll probably need to adjust script execution time before importing will take some time to process the files, especially if you have lots of data.

One way to adjust your global setting of how long a PHP script can run is to edit the php.ini file and change the following line:

;;;;;;;;;;;;;;;;;;;
; Resource Limits ;
;;;;;;;;;;;;;;;;;;;

max_execution_time = 120 ; Maximum execution time of each script, in seconds

By default its set to 30 seconds (evidenced by my node import behavior showing a blank white screen after that time passes by). So I suppose you'd want to change that value only while you're working on the import and then put it back in.

Now that this is de-mistified, I recall seeing my Drupal modules page go blank once in a while and this must have been the reason for it.

1 comment:

sjusic said...

As I was generating CSV files and viewing them in Excel I noticed that they'd get really messed-up every few hundred rows. I thought that the export library was not doing something right but later figured out that there's a maximum text that can be displayed in an Excel cell. When it reached that limit it started badly breaking things into other cells causing a major spillover of the exported data.

This, however, means that you should in theory be able to import things from the original data file ignoring the display problem of excel (so all is good with the export lib).