Bulk Imports Do You Report Missing Data?

Hi,

For those of you that have built bulk importing functionality into applicaitons, do you normally report to users non-mandatory but important data that is missing? If so, do you report the specific fields that are missing data or do you simply query the database for only those records that are missing important data?

I have import functionality that I have built in an individual incomplete data filter that inserts each missing data column per imported user into a flag_this_new_incomplete_data table, but my concern is that when using this filter causes the import time to triple, where the time is not prohibitive just not as responsive than without the filter. The filter checks each row of imported data after each column of data is inserted into their respective relational tables so this method:


        $this->o_LeadManage->group_id = $this->checkType($this->group_id);
        if(empty($this->o_LeadManage->group_id)){
          $primary_incomplete_data = '';
          $primary_incomplete_data = "$this->first_name $this->last_name";
          $primary_incomplete_data = 'Lead: ' .$primary_incomplete_data . ' group';
          $this->incompleteData('Should this lead be assigned to a group(s)?', $primary_incomplete_data, array('group'));
        }

And the method:


  protected function incompleteData($name, $primary_data, $incomplete_column_names){
      if(!is_array($incomplete_column_names)){
        throw new exception('Third argument needs to be an array, '. typeof($incomplete_column_names) . ' given.');
      }
      if(empty($primary_data)){
        throw new exception('Second argument was empty you need to supply a string. Typically 
            this is the most important data in a database row. For example, if the table was 
            postal_codes then a good fields primary data would be that of postal_code');
      }
      $col_count = count($incomplete_column_names);
      if($col_count > 30){
        throw new exception('You have exceeded 30 incomplete columns. Maximum columns allowed are 30.');
      }
      $sql="
        INSERT INTO flag_this_new_incomplete_data
        (
          primary_data_field
          ,name
          ,col1 
          ,col2 
          ,col3 
          ,col4 
          ,col5 
          ,col6 
          ,col7 
          ,col8 
          ,col9 
          ,col10 
          ,col11 
          ,col12 
          ,col13 
          ,col14 
          ,col15 
          ,col16 
          ,col17 
          ,col18 
          ,col19 
          ,col20 
          ,col21 
          ,col22 
          ,col23 
          ,col24 
          ,col25 
          ,col26 
          ,col27 
          ,col28 
          ,col29 
          ,col30
        ) VALUES ( 
          :primary_data
          , :name
          
      ";
      for($i=0; $i<= 29; $i++){
        if(!empty($incomplete_column_names[$i])){
          $sql .= ", '" . $incomplete_column_names[$i] ."'
          ";
        } else {
          $sql .= ", NULL
          ";
        }
      }
      $sql .= " ) ON DUPLICATE KEY UPDATE primary_data_field = :primary_data";
      $stmt = $this->o_Db->prepare($sql);
      $stmt->bindParam(':primary_data', $primary_data);
      $stmt->bindParam(':name', $name);
      $result = $stmt->execute();
    }

This generated results like:

User: Gloria Jones, Generally for mail, email and phone, first name is important. Should this lead have their first name?
User: Gordon Mathes , For reporting purposes is having this leads gender set important?

The alternatives to this approach is to rely more on running an SQL SELECT to find all records that have missing x, y, z however this approach does not leaned itself to getting the more detailed ‘missing data’ results like the current filter; however it is faster.

The other alternative is to not report any missing data and let every user be updated when it is decided that they are missing information.

Your thoughts on how you handle this type of information is important.

Steve

Assign a unique identifier to the table that you are importing to based on the date/ time. Once the upload is complete, run your query against the data that was just imported and present the user a list of exceptions, rather than rolling it into you PHP.

Hi,

Thanks for your idea… However the data gets imported regardless of if some of the fields are missing data. There are only a few fields that have mandatory requirements and if they are not filled exceptions are show. The data that I speak about is more for administrators (after and import) can tell what data is missing (as the information they import comes from different sources).

What I am interested in is:

  1. When faced with this situation where data is not mandatory but is still important do your report specifically what is missing or not?
  2. What techniques (which you did show one approach (thanks) ) do others use when faced with this?

Regards,
Steve

  1. Either show a useful warning, or don’t show a warning at all, but don’t show a generic warning “this and that contains one or more empty fields” kind of thing; that doesn’t help anyone. If anything it only makes people mad (“What does that even MEAN!?”).

  2. I’ve only done imports a few times but usually I just ran the whole file and kept track of the current line number. When a warning or an error was encountered I’d store those in the appropriate array ($warnings / $errors) and then when the importing was completed I showed these $warnings and $errors. The main advantage is that the complete file will be imported even if you have an error early on. It is then up to the user to fix the broken rows and re-import. Using duplicate detection you can than skip lines that were already there, or update them if data has been added.

Thanks ScallioXTX,

You have confirmed my thinking. I will allow the little extra time (10 seconds to import 3000 records versus 30 seconds) and continue to display a more meaningful message.

I have done what you suggest in #2, where I import data into a temporary table. I then loop through that temp table and perform manipulation, insert relational data … The missing data is more as a flag to whomever does a bulk import that some data is missing. If a person adds a user using the manual create user page they will typically complete more user info than is necessarily the case with the bulk import data. Normally no more than a few hundred users would be imported at one time so the time may not even be considered by the end users; however their is something to think about operations (such as bulk import) that may cripple performance if a large number of records are attempted to be imported; this is the crux of why I am worrying about the performance hit :slight_smile:

Best regards,
Steve

I’m sorry I think your missing my point. You should be querying your record set that was just imported for these missing fields, not checking during import with PHP. If he / she wants to re-upload, wipe the data set clear and do it again.

Select anything that was just uploaded in that session and where it meets x, y or z criteria you see fit, display warnings" to user with option to re upload.

Gave this a whirl and it made little difference to the way that I show, as I’m having to loop through the imported result anyway so the approach I use works ok.

Thanks for your suggestion.

Steve

Isn’t the crux of this question really “what do I want to happen if patchy data is bulk loaded?”.

Depending on how much value is gained from prompting either the user, or an administrator I suppose the responses could vary between being:

  1. dump the whole thing, and let me try again (delete the temp table)
  2. dump the bad records, and prompt me for re-entry of those bad records (do a partial commit)
  3. I don’t care, that’s all you are getting (commit it anyway)

If that is the scenario, then maybe a temporary table would be helpful.

Thanks Cups,

Yup :slight_smile: (said in so many fewer words nice :smiley: )

Currently,

  1. Use SQL to import a CSV file into a temporary table (non filtered with the WHERE condition)
  2. loop through the each row of the temporary table and extract data and perform checks if data already exists if not Insert.
  3. Some of the inserted data will have missing fields like ‘First Name’, ‘Groups’, ‘Enrollment Number’… none of these are mandatory although the person importing the data is not the same person that prepared the csv, so they may want to add data such as the ‘First Name’ and maybe and 'Enrollment Number.
  4. It is in this loop as each row is being processed where I have the missing data fields reported in a table that I can SELECT and GROUP all missing data at the end of the Import. It is the inserting each of the missing rows that is causing the extra delay.

I am going to use your option 3 with a slight modification. I will INSERT the data regardless and will keep the missing data reported for each row.

In the cases where users are, in many ways, blind to what they will be getting in a bulk import the more information provided allows them to make decisions if important things should be corrected.

Steve

There should be no reason to loop through with PHP other than to print the results in table format on the page. Use SQL to do your error checking.

So if you have each row of a CSV that has data related to one user and the data belongs to approximately 60 different tables in the database you are going to just use SQL to search the temporary table where all the data is raw imported and then INSERT the data into the proper tables, and in some way using SQL your going to insure that the data that relates to each user is relational to the each correct user? On top of this your doing your checks for missing data (In my case these are not errors as it is optional data) using SQL without any loops? I’m interested in how you propose to do this using just SQL.

I didn’t see that you’ll be appending to different tables, but this matters little. You’ll simply append to a temp table, perhaps with the table name in a new column that it would be appended to. Use separate queries if necessary and then pile them on top of each other using UNION ALL to display your warnings to the front end.

Can you supply some examples of different “exceptions” that you would want a warning thrown for?

Hi

These are the types of warnings:

User: Gloria Jones, Generally for mail, email and phone, first name is important. Should this user have their first name?
User: Gordon Mathes , For reporting purposes is having this users gender set important?

Well I was hoping for some data sets, but this looks very straight forward for SQL. Doesnt really matter what you return, but once you have your record numbers and user names or whatever that would qualify for a warning, use PHP to “present” the data, rather than do the work of finding it.