Getting delimiter from a line

Not sure if you got my point, we will return only 1 delimiter based on the following cases:

Case 1:
if count of ; greater than , we return ; as delimiter

Case 2:
If count of ; equals , we return , since it was the one found first.

Does this make sense to you?

Thanks

Yes, I get your point, but you have a file that has 2 delimiters. If you only take one, and use it, you will still end up with an error parsing that CSV or bad data. Because you can’t account for the second delimiter that exists.

So in your example, the first set of values you might receive using ; because it happens more frequently is

abc,111\
def; 111

The second set of data would be

ijk; 222

All that is based on the assumption PHP can handle it that way and doesn’t return FALSE designating an error due to the comma delimiter.

Just in case you still insist on wanting to not throw an error when multiple delimiters is found, here is a working version:

csv.php

<?php
class CSV
{
	private $filePath;
	private $fileContents;

	const ACCEPTABLE_DELIMITERS = '~[#,;:|\	]~'; // acceptable delimiters
	//const EXCLUDED_CHARS = '~[a-zA-Z0-9.\\r\
\\f ]~'; // delimiters can't be characters, numbers or spaces

	// Constructor accepting a file path
	public function __construct($file)
	{
		$this->filePath = $file;
		// Read the file contents and store it into a private variable
		$this->fileContents = file($file);
	}

	public function getDelimiter()
	{
		$delimitersData = null;
		// Loop through each line in the file, identify the index as the line number and the content of the line as $line
		foreach ($this->fileContents as $lineNumber => $line)
		{
			// Don't parse an empty line, it could lead to weird results
			if (!empty($line))
			{
				$quoted = false;
				$delimitersForGivenLine = array();

				// Loop through each character in the line
				for ($i = 0; $i < strlen($line) - 1; $i++)
				{
					// Read the character we are currently evaluating
					$char = substr($line, $i, 1);
					// If the character is a ", set $quoted to its opposite value 
					// (it starts out as false, so using !$quoted sets it to true, when it encounters another ", it will set it back to false and so on)
					if ($char === '"')
					{
						$quoted = !$quoted;
					}
					// Check if the character we are evaluation is an Acceptable Delimiter (or is not an Excluded Character)
					else if (!$quoted && preg_match(self::ACCEPTABLE_DELIMITERS, $char))
					//else if (!$quoted && !preg_match(self::EXCLUDED_CHARS, $char))
					{
						// Check if the character/delimiter was already found on this line and update its' properties accordingly
						if (array_key_exists($char, $delimitersForGivenLine))
						{
							// Update the count for this delimiter since we just found another occurrence
							$delimitersForGivenLine[$char]['count']++;
							// Add the content of the line to this delimiter, so we know which delimiter to use on it later (this actually is useless -- I think)
							$delimitersForGivenLine[$char]['lines'][$lineNumber] = $line;
						}
						else
						{
							// This character/delimiter has not been found previously on this line, so create it
							$delimitersForGivenLine[$char]['count'] = 1;
							// Assign this delimiter the current line, so we know how to read that line later on
							$delimitersForGivenLine[$char]['lines'][$lineNumber] = $line;
						}
					}
				}

				// On the first line of the file, this variable will be null, now we need to set it. It will be used for comparing the delimiters of the previous line to the current line
				if ($delimitersData === null || empty($delimitersData))  
				{
					$delimitersData = $delimitersForGivenLine;
				}
				// Verify both the previous line's data and the current line's data have delimiters (otherwise the comparison isn't useful)
				else if (count($delimitersData) > 0 && count($delimitersForGivenLine) > 0)
				{
					// Store the current line's data into a new variable
					$newDelimitersByLine = $delimitersForGivenLine;
					// Loop through the previous lines delimiters (key is the delimiter character, and value is an array consisting of count and lines)
					foreach ($delimitersData as $key => $value)
					{
						// Verify the previous line's delimiter(s) exist in the current line's evaluation and if they do, verify the counts are the same
						// OR check that the previous line's delimiter(s) do not exist in the current line's evaluation
						// The point here is to see if we need to merge arrays
						// So why not use array_merge()? Good question, because it overwrites the keys of your arrays, and the keys are important to our system
						if ((array_key_exists($key, $delimitersForGivenLine) && $delimitersForGivenLine[$key]['count'] === $value['count'])
							|| !array_key_exists($key, $delimitersForGivenLine))
						{
							// This line is for when !array_key_exists($key, $delimitersForGivenLine) evaluates true, it writes the count into the 
							// new variable for the given delimiter (key)
							$newDelimitersByLine[$key]['count'] = $value['count'];

							// If the delimiter existed in the previous line, loop through the line numbers, keeping their index and values and 
							// copy them into the new variable.
							if (array_key_exists($key, $delimitersForGivenLine))
							{
								foreach ($value['lines'] as $lineNumber => $line)
									$newDelimitersByLine[$key]['lines'][$lineNumber] = $line;
							}
							else
							{
								// Since the delimiter didn't exist in the prior line, just write the lines directly over (we don't need to worry about keeping existing data)
								$newDelimitersByLine[$key]['lines'] = $value['lines'];
							}
						}
					}
					// Store the merged array so it can be used again for the next line (so it keeps a running count)
					$delimitersData = $newDelimitersByLine;
				}
			}
		}

		// Sort the array of delimiter data using a custom sort routine and maintaining the key indexes
		// This is to put the most frequent delimiter and its data at the top of the array
		uasort($delimitersData, "CSV::sortDelimiters");

		//Remove delimiters that don't have the exact count as the primary delimiter
		$initialCount = null;
		$finalDelimiterData = array();

		// Loop through each delimiter found in the file ($key is the delimiter character, and $data is the count/lines info)
		foreach ($delimitersData as $key => $data)
		{
			// Since the array is already sorted, we want to read the first delimiter and store it
			// All other delimiters will ONLY be stored if their count matches the first delimiter 
			// (so you can't have a delimiter of ";" that indicates it has 8 counts per line and have a delimiter of ","
			//that indicates it has 2 counts per line; the "," simply can't be an accurate delimiter in this case)
			if ($initialCount === null)
			{
				$initialCount = $data['count'];
				$finalDelimiterData[$key] = $data;
			}
			else
			{
				// Only store the delimiter if the count matches the most frequent found delimiter
				if ($initialCount === $data['count'])
					$finalDelimiterData[$key] = $data;
			}
		}

		// Return the delimiter information back, so it could be looped through and parsed using str_getcsv
		return $finalDelimiterData;
	}

        // Custom Sort for the Delimiters
        public static function sortDelimiters($a, $b)
        {
            // If the delimiter data for item $a in the array, matches item $b, return 0
            if ($a['count'] === $b['count'] && sizeof($a['lines']) === sizeof($b['lines']))
            {
                return 0;
            }

            // if $a has more lines associated to it than $b, return -1 so it leaves $a higher than $b,
            // otherwise, when $b needs to move up ahead of $a
            return sizeof($a['lines']) > sizeof($b['lines']) ? -1 : 1;
      } 
}

test.php

<?php
	include('csv.php');

	//$files = array('data.txt', 'comma.txt', 'colon.txt', 'pipe.txt', 'pound.txt', 'semicolon.txt', 'tab.txt', 'email.txt', 'mixture.txt');
	$files = array('data.txt', 'mixture.txt');
	foreach ($files as $file)
	{
		$csv = new CSV('files/' . $file);
		$delimiterData = $csv->getDelimiter();
		$delimiter = key($delimiterData);
		echo 'Delimiter for ' . $file . ' is ' . $delimiter . ' (' . ord($delimiter) . ')<br />';
		echo '<pre>';
		echo var_dump($delimiterData);
		echo '</pre><br />';
	}

data.txt

abc,111
def; 111
ijk; 222

output

Delimiter for data.txt is ; (59)
array(2) {
  [";"]=>
  array(2) {
    ["count"]=>
    int(1)
    ["lines"]=>
    array(2) {
      [2]=>
      string(8) "ijk; 222"
      [1]=>
      string(10) "def; 111
"
    }
  }
  [","]=>
  array(2) {
    ["count"]=>
    int(1)
    ["lines"]=>
    array(1) {
      [0]=>
      string(9) "abc,111
"
    }
  }
}

mixture.txt

this|is|"a test"|to|123|see|how|it|works
this; is; "a test"; to; 123; see; how; it; works
123.|can?|you&|see|what|I'm|doing?|eight*|nine

output

Delimiter for mixture.txt is | (124)
array(2) {
  ["|"]=>
  array(2) {
    ["count"]=>
    int(8)
    ["lines"]=>
    array(2) {
      [2]=>
      string(46) "123.|can?|you&|see|what|I'm|doing?|eight*|nine"
      [0]=>
      string(42) "this|is|"a test"|to|123|see|how|it|works
"
    }
  }
  [";"]=>
  array(2) {
    ["count"]=>
    int(8)
    ["lines"]=>
    array(1) {
      [1]=>
      string(50) "this; is; "a test"; to; 123; see; how; it; works
"
    }
  }
}

I’ve updated my prior post to return ALL delimiter data (as it may be helpful to your project). In short, it allows you to know which lines are associated to each delimiter, so you could use [fphp]str_getcsv[/fphp] to parse line by line by its determined delimiter.

Thanks for all those.

Can you also please comment your codes so that it will be easy for me to understand and to make any future modifications?

Many Thanks

Okay, I added a bunch of comments to my prior php code in Post #43.

I’ve also been playing with making it more OOP (if that is of any interest to you). I’m 95% there, but I really want to make an additional change to support associative keys in it that I haven’t quite figured out. If that is of any interest, I’ll post it as a zip file, as it contains several more files (same logic, just split up by responsibility).

Many thanks for your comments.

I was testing your code in post #43 with teh following data, but its returning blank result. Can you please tell me wots wrong?


C:\\Users\\Fabien\\Desktop\\Combactive\\ACTIONS (Réunions et Courriers d'information pour adhérents)\\EMAGNY\\2012-2013\\Médias + Communication\\Francophone.txt 21/05/2013 22:35:14
Progitek [244 e-mails]
aaa.aaa@gmail.ch;
aaa.chat.enfant@gmail.fr;
aaa@aspas-gmail.org;
aaa@gmail.com;
aaa@asms-swiss.ch;
aaa@gmail.org;
aaa@gmail.be;
aaa.ellidge@gmail.fr;
aaa@gmail.fr;
aaa@gmail.be;
aaa.wahf@gmail.be;

THanks

I copied and pasted the code straight from Post 43 and received the following output for your data

Delimiter for data.txt is ; (59)
array(1) {
  [";"]=>
  array(2) {
    ["count"]=>
    int(1)
    ["lines"]=>
    array(10) {
      [9]=>
      string(15) "aaa@gmail.be;
"
      [8]=>
      string(15) "aaa@gmail.fr;
"
      [7]=>
      string(23) "aaa.ellidge@gmail.fr;
"
      [6]=>
      string(15) "aaa@gmail.be;
"
      [5]=>
      string(16) "aaa@gmail.org;
"
      [4]=>
      string(20) "aaa@asms-swiss.ch;
"
      [3]=>
      string(16) "aaa@gmail.com;
"
      [2]=>
      string(22) "aaa@aspas-gmail.org;
"
      [1]=>
      string(27) "aaa.chat.enfant@gmail.fr;
"
      [0]=>
      string(19) "aaa.aaa@gmail.ch;
"
    }
  }
}

WHen I am using the same code separately in another file, im getting following output. PFA zip file


Delimiter for fake-data-initially.txt is : (58)

array(1) {
  [":"]=>
  array(2) {
    ["count"]=>
    int(3)
    ["lines"]=>
    array(1) {
      [0]=>
      string(175) "C:\\Users\\Fabien\\Desktop\\Combactive\\ACTIONS (Réunions et Courriers d'information pour adhérents)\\EMAGNY\\2012-2013\\Médias + Communication\\Francophone.txt 21/05/2013 22:35:14
"
    }
  }
}


Your fake-data-initially.txt file is a bit funky…

the first line contains a file path, followed by what looks to be a metadata line… neither of those would be beneficial to trying to identify a delimiter, the only pieces that are beneficial are lines 3-13. Not sure how you would tell the system to ignore those two lines…

Ok but if we look at the logic of the code, then the repetition of : is less than ;

hence the delimiter should return ;

or have i completely misunderstood the logic of the getDelimiter function?

THanks

Okay, try changing

if ($delimitersData === null)

To

if ($delimitersData === null || empty($delimitersData))

That made a difference for me.

Oh, and you will need to update the sortDelimiters function to be

	// Custom Sort for the Delimiters
	public static function sortDelimiters($a, $b)
	{
		// If the delimiter data for item $a in the array, matches item $b, return 0
		if ($a['count'] === $b['count'] && sizeof($a['lines']) === sizeof($b['lines']))
		{
			return 0;
		}

		// if $a has more lines associated to it than $b, return -1 so it leaves $a higher than $b,
		// otherwise, when $b needs to move up ahead of $a
		return sizeof($a['lines']) > sizeof($b['lines']) ? -1 : 1;
	}

For the following data, can u pls tell me why its returning , (comma) ?

abc@gmail.com;
def@gmail.com,
ijd@hotmail.com,
abc@gmail.com;

Apart from the question in post #54, can u also tell me why does php put extra lines in between array elements when a file is read using the file() function but with the fgets() function it does not add lines?

Array generated using fgets() function


Array
(
    [0] => C:\\Users\\Fabien\\Desktop\\Combactive\\ACTIONS (Réunions et Courriers d'information pour adhérents)\\EMAGNY\\2012-2013\\Médias + Communication\\Francophone.txt 21/05/2013 22:35:14
    [1] => Progitek [244 e-mails]
    [2] => aaa.aaa@gmail.ch;
    [3] => aaa.chat.enfant@gmail.fr;
    [4] => aaa@aspas-gmail.org;
    [5] => aaa@gmail.com;
    [6] => aaa@asms-swiss.ch;
    [7] => aaa@gmail.org;
    [8] => aaa@gmail.be;
    [9] => aaa.ellidge@gmail.fr;
    [10] => aaa@gmail.fr;
    [11] => aaa@gmail.be;
    [12] => aaa.wahf@gmail.be;
)

Array generated using file() function


Array
(
    [0] => C:\\Users\\Fabien\\Desktop\\Combactive\\ACTIONS (Réunions et Courriers d'information pour adhérents)\\EMAGNY\\2012-2013\\Médias + Communication\\Francophone.txt 21/05/2013 22:35:14

    [1] => Progitek [244 e-mails]

    [2] => aaa.aaa@gmail.ch;

    [3] => aaa.chat.enfant@gmail.fr;

    [4] => aaa@aspas-gmail.org;

    [5] => aaa@gmail.com;

    [6] => aaa@asms-swiss.ch;

    [7] => aaa@gmail.org;

    [8] => aaa@gmail.be;

    [9] => aaa.ellidge@gmail.fr;

    [10] => aaa@gmail.fr;

    [11] => aaa@gmail.be;

    [12] => aaa.wahf@gmail.be;

)

Here is the code for putting contents of file in array using fgets() function.


	$counter = 0;
	while(!feof($fp)){
        if ($counter == 0) {
        $line = trim( fgets($fp, 4096) );
        } else {
        $line =  fgets($fp, 4096);

        if(mb_detect_encoding($line, 'UTF-8', true)) {
          $line = trim( $line );

        } else {
          $line = trim( utf8_encode($line ) );

        }
        }

        if (strlen($line) != 0) {
          $array[] = $line;
         }
        }
        $counter++;
      }
		fclose($fp);

Thanks

Because the comma and semi-colon are used equally. It doesn’t take a preference on first one encountered, so it just returns whichever one is at the top of the array (which happens to be comma).

If you use the delimiterData to parse your file, it won’t matter, as you can then parse the values that are comma delimited and parse the ones that are semi-colon delimited.

I’d assume that fgets is ignoring the \r,
, or \f indicators, but file() is keeping them.

[edit]Nevermind, it is because you are calling trim() on the line you read during fgets. That is removing the \r,
, and \f indicators that file() has kept[/edit]

Can you also please comment the code for this post of yours http://www.sitepoint.com/forums/showthread.php?1122995-Getting-delimiter-from-a-line&p=5511703&viewfull=1#post5511703

PS: Pls dont forget to update the above mentioned code for any updations it might require. For example the changes u made in post #52 (http://www.sitepoint.com/forums/showthread.php?1122995-Getting-delimiter-from-a-line&p=5512773&viewfull=1#post5512773)

THanks

I’ve updated #43 (since it was the latest version) accordingly. I also added an edit to all other posts stating those were not the latest version of the code.

Hello again :slight_smile:

I am currently using your code posted in post #43

One issue is with the following data, the getDelimiter function returns null


EMAIL;
abc@gmail.com; 
def@gmail.com;  
ghi@gmail.com;

But when I add some chars after semicolon it works like a charm


EMAIL; hey
abc@gmail.com; 
def@gmail.com;  
ghi@gmail.com;

It didnt resolve even when I added the trim() function.

Can you please let me know why is this happening?

Thanks

I don’t get that result, I get ‘;’, which is correct…
See attached code