For the Title and Pay Plan project, I made extensive use of the node_import module. Here are a few of the challenges I had with getting data imported.
Background
My data model uses CCK fields and node reference to relate a parent record with one or more child records. For example a Step Based position may have up to 37 steps for hourly, monthly and annual salary data. Our data comes from the mainframe-based Payroll Personnel System (PPS). Data is extracted into CSV files for the various position types (step, open, grade, msp). The data includes Title, Title Code, Bargaining Unit, Salary Effective Date, Shift rates, Weekend rates, etc.
So, I created content types for the parent and children. I created CCK fields to hold the data. I was using the Date module for all the date data types. I used nodereference to link the parent and child records via nid.
Data Preparation
The application is desiged to allow for dump and re-importing as necessary. My process for taking the extracted data and preparing it for import goes like this.
- Add column Position Type and fill with grade_based
- Edit column headers in CSV file to reflect:
- Title Code, Title, Title, Bargaining Unit, FLSA, Grade Salary Effective Date, Position Type
- Check the file for extra spacing using the Mac Excel Expander > Killers > Kill Irregular Spacing. This will kill the extra padding found in the Title column.
- Dates need to be in the format of m/d/Y - hh:mm:ss so that the node_import script interprets them correctly. This can be done in Excel or in TextMate.
- Save
- Open file in TextMate. Save as to make sure we've got LF selected for the line feeds. Do the search and replace on dates if necessary.
- Use node_import via Drupal to pull the contents of the file into the Title content type
- Run through the seven steps of the import script.
- Resolve any errors and keep re-running till it's 100% correct.
Known Issues
Node import may have this fix in place by the time you read this, but read through the long thread at http://drupal.org/node/374346. The basic message is that you'll need to add date.inc in the appropriate sub-directory of the node_import directory. I ended up doing this more than once through an upgrade and once when I moved the site to production.
Importing CCK Date fields is problematic. It tended to throw lots of errors and caused me to try many different date formats to get the data inserted properly. Here's my big aha moment. My application didn't need hours, minutes and seconds, but I had selected that granularity in the CCK manage fields options. My other ahamoment was that if I formatted dates in Excel to YYYY-MM-DD and used the Custom option in node_import with a value of 'Y-m-d' my date problems went away.