Getting delimiter from a line

While the “EXCLUDED_CHARS” concept makes more sense but it does not work perfect for a particular case.

Consider the CSV file has only these 2 lines:

EMAIL
test@gmailcom

It shows the delimiter as @ which is wrong :frowning:

THanks

We’ll you don’t have a lot of test data there, what did you expect it to come up with if you don’t exclude @? There is no delimiter in your example…

If the CSV does not have any delimiter then it should return empty value.

And it might do that, if you exclude @

In fact, it does. It will return NUL or a value of chr(0). Which you can see by outputting ord($email->getDelimiter()); of which I used the following exclusions:

const EXCLUDED_CHARS = '~[a-zA-Z0-9 @]~'; // delimiters can't be characters, numbers or spaces

Keep in mind the goal here is to find an UNKNOWN delimiter. Unless you give it some indication of the type of characters to seek, it will make a delimiter out of ANYTHING.

Last but not least, I came up with one more check that may help in the e-mail test you have, but I don’t promise it won’t affect something else (although, it didn’t affect any of the tests I wrote up).

Edit:

Please see post #43 for the most up-to-date version of this code.

<?php
class CSV
{
	private $filePath;
	private $fileContents;
	//const ACCEPTABLE_DELIMITERS = '~[#,;:|\	]~'; // acceptable delimiters
	const EXCLUDED_CHARS = '~[a-zA-Z0-9.\\r\
\\f ]~'; // delimiters can't be characters, numbers or spaces

	public function __construct($file)
	{
		$this->filePath = $file;
		$this->fileContents = file($file);
	}

	public function getDelimiter()
	{
		$delimitersByLine = null;
		foreach ($this->fileContents as $lineNumber => $line)
		{
			$quoted = false;
			$delimiters = array();

			for ($i = 0; $i < strlen($line) - 1; $i++)
			{
				$char = substr($line, $i, 1);
				if ($char === '"')
				{
					$quoted = !$quoted;
				}
				//else if (!$quoted && preg_match(self::ACCEPTABLE_DELIMITERS, $char))
				else if (!$quoted && !preg_match(self::EXCLUDED_CHARS, $char))
				{
					if (array_key_exists($char, $delimiters))
					{
						$delimiters[$char]++;
					}
					else
					{
						$delimiters[$char] = 1;
					}
				}
			}

			if ($delimitersByLine === null)
			{
				$delimitersByLine = $delimiters;
			}
			else if (count($delimitersByLine) > 0 && count($delimiters) > 0)
			{
				$newDelimitersByLine = $delimiters;
				foreach ($delimitersByLine as $key => $value)
				{
					if ((array_key_exists($key, $delimiters) && $delimiters[$key] === $value)
						|| !array_key_exists($key, $delimiters))
					{
						$newDelimitersByLine[$key] = $value;
					}
				}
				$delimitersByLine = $newDelimitersByLine;

				if (sizeof($delimitersByLine) < 2)
					break;
			}
		}

		arsort($delimitersByLine);
		$firstDelimiter = key($delimitersByLine);

		if (sizeof($delimitersByLine) > 1)
		{
			next($delimitersByLine);
			$nextDelimiter = key($delimitersByLine);
			if ($delimitersByLine[$firstDelimiter] === $delimitersByLine[$nextDelimiter])
			{
				// multiple delimiters with the same frequency found
				// throw an error
				throw new UnexpectedValueException();
			}

			return $firstDelimiter;
		}
		else
			return $firstDelimiter;
	}
}

This is the part I changed


			if ($delimitersByLine === null)
			{
				$delimitersByLine = $delimiters;
			}
			else if (count($delimitersByLine) > 0 && count($delimiters) > 0)

My new test file (which now writes out the ord() value of the delimiter too)

<?php
	include('csv.php');

	$files = array('comma.txt', 'colon.txt', 'pipe.txt', 'pound.txt', 'semicolon.txt', 'tab.txt', 'email.txt', 'mixture.txt');
	foreach ($files as $file)
	{
		$csv = new CSV('files/' . $file);
		$delimiter = $csv->getDelimiter();
		echo 'Delimiter for ' . $file . ' is ' . $delimiter . ' (' . ord($delimiter) . ')<br />';
	}

By setting $delimitersByLine to null at the beginning, and verifying it is still null (so it only gets set from the first line) resolves an issue with zero delimiters being found on the first line. So now when it finds zero delimiters on the first line, it will store an empty array in $delimitersByLine and not overwrite it with line 2 (if it contains delimiters).

The else if then verifies that the line prior to the one being analyzed contained at least one delimiter and the current line contains at least one delimiter. If that is true, then it looks at the delimiters and combines them accordingly into a new array for tallying and continues to move forward.

Now for the e-mail, it produces the following output (the parenthesizes shows the ord() value of the delimiter found).

Delimiter for comma.txt is , (44)
Delimiter for colon.txt is : (58)
Delimiter for pipe.txt is | (124)
Delimiter for pound.txt is # (35)
Delimiter for semicolon.txt is ; (59)
Delimiter for tab.txt is (9)
Delimiter for email.txt is (0)

Fatal error: Uncaught exception 'UnexpectedValueException' in M:\\SVN\\sitepoint\	runk\\Sitepoint\\cancer10\\csv.php:77 Stack trace: #0 M:\\SVN\\sitepoint\	runk\\Sitepoint\\cancer10\	est.php(8): CSV->getDelimiter() #1 {main} thrown in M:\\SVN\\sitepoint\	runk\\Sitepoint\\cancer10\\csv.php on line 77

Hi cpradio

I have been testing ur script with some radom csv files and came with one that generated an exception error and since I am unable to understand why it occurred I would like to seek ur help in finding out why.

I have PMed u the csv file URL, pls check ur inbox.

FYI I am using the following code of urs



    function getDelimiter($file)
    {
        $delimitersByLine = null;
        $excluded = '~[a-zA-Z0-9.\\r\
\\f ]~';

        foreach ($file as $lineNumber => $line)
        {
            $quoted = false;
            $delimiters = array();


            for ($i = 0; $i < strlen($line) - 1; $i++)
            {
                $char = substr($line, $i, 1);
                if ($char === '"')
                {
                    $quoted = !$quoted;
                }

                else if (!$quoted && !preg_match($excluded, $char))
                {
                    if (array_key_exists($char, $delimiters))
                    {
                        $delimiters[$char]++;
                    }
                    else
                    {
                        $delimiters[$char] = 1;
                    }
                }
            }

            if ($delimitersByLine === null)
            {
                $delimitersByLine = $delimiters;
            }
            else if (count($delimitersByLine) > 0 && count($delimiters) > 0)
            {
                $newDelimitersByLine = $delimiters;
                foreach ($delimitersByLine as $key => $value)
                {
                    if ((array_key_exists($key, $delimiters) && $delimiters[$key] === $value)
                        || !array_key_exists($key, $delimiters))
                    {
                        $newDelimitersByLine[$key] = $value;
                    }
                }
                $delimitersByLine = $newDelimitersByLine;

                if (sizeof($delimitersByLine) < 2)
                    break;
            }
        }

        arsort($delimitersByLine);
        $firstDelimiter = key($delimitersByLine);

        if (sizeof($delimitersByLine) > 1)
        {
            next($delimitersByLine);
            $nextDelimiter = key($delimitersByLine);
            if ($delimitersByLine[$firstDelimiter] === $delimitersByLine[$nextDelimiter])
            {
                // multiple delimiters with the same frequency found
                // throw an error
                throw new UnexpectedValueException();
            }

            return $firstDelimiter;
        }
        else
            return $firstDelimiter;
    }

$fileArray = file('test.csv');
$delimiter = getDelimiter($fileArray);
echo $delimiter;

Error:

Fatal error: Uncaught exception ‘UnexpectedValueException’ in /var/www/php2csv/index.php:140 Stack trace: #0 /var/www/php2csv/index.php(184): getDelimiter(Array) #1 {main} thrown in /var/www/php2csv/index.php on line 140

Put var_dump($delimitersByLine); right before

                throw new UnexpectedValueException();

Then give me the output generated from that statement.

Hi

This is the output.

array(25) { [“-”]=> int(3) [“;”]=> int(3) [“>”]=> int(2) [“<”]=> int(2) [“^”]=> int(1) [" “]=> int(1) [“”]=> int(1) [“©”]=> int(1) [“¢”]=> int(1) [“¯”]=> int(1) [”¡“]=> int(1) [“Å”]=> int(1) [“ƒ”]=> int(1) [”§“]=> int(1) [”/“]=> int(1) [”=“]=> int(1) [”_“]=> int(1) [”:“]=> int(1) [”@“]=> int(1) [”'“]=> int(1) [”%“]=> int(1) [”+“]=> int(1) [”)“]=> int(1) [”("]=> int(1) [“Ô]=> int(1) } -

Okay, it sees two possible delimiters, one with - and one with ;, so you can either exclude “-” and it will work.

$excluded = '~[a-zA-Z0-9.\\r\
\\f\\- ]~';

Hi

If I use

const ACCEPTABLE_DELIMITERS = '~[#,;:|]~'; 

What do I have to add for it to accept TABs?

Thanks

\

const ACCEPTABLE_DELIMITERS = '~[#,;:|\	]~';

Great…working except for the case where u have empty line feeds in the first few line…its displaying blank result in delimiter.

Any solutions?

please check attachment for example.

Many thanks

I call that bad data, I guess you could just add a !empty($line) in the foreach

Edit:

Please see post #43 for the most up-to-date version of this code.

    function getDelimiter($file)
    {
        $delimitersByLine = null;
        $excluded = '~[a-zA-Z0-9.\\r\
\\f ]~';

        foreach ($file as $lineNumber => $line)
        {
            if (!empty($line))
            {
                $quoted = false;
                $delimiters = array();
            
            
                for ($i = 0; $i < strlen($line) - 1; $i++)
                {
                    $char = substr($line, $i, 1);
                    if ($char === '"')
                    {
                        $quoted = !$quoted;
                    }
                
                    else if (!$quoted && !preg_match($excluded, $char))
                    {
                        if (array_key_exists($char, $delimiters))
                        {
                            $delimiters[$char]++;
                        }
                        else
                        {
                            $delimiters[$char] = 1;
                        }
                    }
                }

                if ($delimitersByLine === null)
                {
                    $delimitersByLine = $delimiters;
                }
                else if (count($delimitersByLine) > 0 && count($delimiters) > 0)
                {
                    $newDelimitersByLine = $delimiters;
                    foreach ($delimitersByLine as $key => $value)
                    {
                        if ((array_key_exists($key, $delimiters) && $delimiters[$key] === $value)
                            || !array_key_exists($key, $delimiters))
                        {
                            $newDelimitersByLine[$key] = $value;
                        }
                    }
                    $delimitersByLine = $newDelimitersByLine;

                    if (sizeof($delimitersByLine) < 2)
                        break;
                }
            }
        }

        arsort($delimitersByLine);
        $firstDelimiter = key($delimitersByLine);

        if (sizeof($delimitersByLine) > 1)
        {
            next($delimitersByLine);
            $nextDelimiter = key($delimitersByLine);
            if ($delimitersByLine[$firstDelimiter] === $delimitersByLine[$nextDelimiter])
            {
                // multiple delimiters with the same frequency found
                // throw an error
                throw new UnexpectedValueException();
            }

            return $firstDelimiter;
        }
        else
            return $firstDelimiter;
    }

$fileArray = file('test.csv');
$delimiter = getDelimiter($fileArray);
echo $delimiter;

yup worked like a charm.

Another set of data that is not returning a delimiter:

C:\Users\Fabien\Desktop\Combactive\ACTIONS (Réunions et Courriers d’information pour adhérents)\EMAGNY\2012-2013\Médias + Communication\Francophone.txt 21/05/2013 22:35:14
Progitek [3 e-mails]
mat.tierschutz@bluewin.co;
cat.chat.enfant@gmail.com;
bay@aspas-nature.in;
dd.wahf@gmail.be;
aa@gmail.com;
bb3a@gmail.com;
cc4a@gmail.fr;

The delimiter in this case should be ; without any second thought

Thanks

Well, that is probably because I accidentally switched it back to using $excluded characters instead of a range of acceptable delimiters. If you put back in the acceptable delimiters you want to track, then it may work. It is catching the @ and the ; as being possible delimiters with the latest update I posted.

I am already using the acceptable chars instead of excluded chars, pls chk code below.


<?php
 function getDelimiter($file)
    {
        $delimitersByLine = null;
        $delimiterArray = array();
        // Exclude the following decimal chars.
        $excluded = array(10,11,12,13);
        $included = '~[#,;|\	]~';

        foreach ($file as $lineNumber => $line)
        {

            if( in_array(ord($line), $excluded)) continue;



            $quoted = false;
            $delimiters = array();

            for ($i = 0; $i < strlen($line) - 1; $i++)
            {
                $char = substr($line, $i, 1);

                if ($char === '"')
                {
                    $quoted = !$quoted;
                }



                else if (!$quoted && preg_match($included, $char))
                {
                    if (array_key_exists($char, $delimiters))
                    {
                        $delimiters[$char]++;
                    }
                    else
                    {
                        $delimiters[$char] = 1;
                    }
                }
            }

            if ($delimitersByLine === null)
            {
                $delimitersByLine = $delimiters;
            }
            else if (count($delimitersByLine) > 0 && count($delimiters) > 0)
            {
                $newDelimitersByLine = $delimiters;
                foreach ($delimitersByLine as $key => $value)
                {
                    if ((array_key_exists($key, $delimiters) && $delimiters[$key] === $value)
                        || !array_key_exists($key, $delimiters))
                    {
                        $newDelimitersByLine[$key] = $value;
                    }
                }
                $delimitersByLine = $newDelimitersByLine;

                if (sizeof($delimitersByLine) < 2)
                    break;
            }
        }

        arsort($delimitersByLine);
        $firstDelimiter = key($delimitersByLine);

        if (sizeof($delimitersByLine) > 1)
        {
            next($delimitersByLine);
            $nextDelimiter = key($delimitersByLine);
            if ($delimitersByLine[$firstDelimiter] === $delimitersByLine[$nextDelimiter])
            {
                // multiple delimiters with the same frequency found
                // throw an error
                var_dump($delimitersByLine);
                //throw new UnexpectedValueException();
            }

            $delimiter = $firstDelimiter;
        }
        else
            $delimiter = $firstDelimiter;

            //ed('['.ord($delimiter).']');
        $delimiterArray = array(
                'DELIMITER' => $delimiter,
                'DELIMITER_DESC' => $delimiter
            );

        // Check delimiters
        if( ! $delimiter ) $delimiterArray['DELIMITER_DESC'] = 'NO_DELIMITER_FOUND';
        if( ord($delimiter)==9 ) $delimiterArray['DELIMITER_DESC'] = 'TAB';



        return $delimiterArray;
    }

?>

No, idea what you may have done to your code that breaks it, but when I put it back into the original code, it correctly identifies ; as the delimiter

Edit:

Please see post #43 for the most up-to-date version of this code.

csv.php

<?php
class CSV
{
	private $filePath;
	private $fileContents;
	const ACCEPTABLE_DELIMITERS = '~[#,;|\	]~'; // acceptable delimiters
	//const EXCLUDED_CHARS = '~[a-zA-Z0-9.\\r\
\\f ]~'; // delimiters can't be characters, numbers or spaces

	public function __construct($file)
	{
		$this->filePath = $file;
		$this->fileContents = file($file);
	}

	public function getDelimiter()
	{
		$delimitersByLine = null;
		foreach ($this->fileContents as $lineNumber => $line)
		{
			if (!empty($line))
			{
				$quoted = false;
				$delimiters = array();

				for ($i = 0; $i < strlen($line) - 1; $i++)
				{
					$char = substr($line, $i, 1);
					if ($char === '"')
					{
						$quoted = !$quoted;
					}
					else if (!$quoted && preg_match(self::ACCEPTABLE_DELIMITERS, $char))
					//else if (!$quoted && !preg_match(self::EXCLUDED_CHARS, $char))
					{
						if (array_key_exists($char, $delimiters))
						{
							$delimiters[$char]++;
						}
						else
						{
							$delimiters[$char] = 1;
						}
					}
				}

				if ($delimitersByLine === null)
				{
					$delimitersByLine = $delimiters;
				}
				else if (count($delimitersByLine) > 0 && count($delimiters) > 0)
				{
					$newDelimitersByLine = $delimiters;
					foreach ($delimitersByLine as $key => $value)
					{
						if ((array_key_exists($key, $delimiters) && $delimiters[$key] === $value)
							|| !array_key_exists($key, $delimiters))
						{
							$newDelimitersByLine[$key] = $value;
						}
					}
					$delimitersByLine = $newDelimitersByLine;

					if (sizeof($delimitersByLine) < 2)
						break;
				}
			}
		}

		arsort($delimitersByLine);
		$firstDelimiter = key($delimitersByLine);

		if (sizeof($delimitersByLine) > 1)
		{
			next($delimitersByLine);
			$nextDelimiter = key($delimitersByLine);
			if ($delimitersByLine[$firstDelimiter] === $delimitersByLine[$nextDelimiter])
			{
				// multiple delimiters with the same frequency found
				// throw an error
				throw new UnexpectedValueException();
			}

			return $firstDelimiter;
		}
		else
			return $firstDelimiter;
	}
}

test.php

<?php
	include('csv.php');

	$files = array('data.txt', 'comma.txt', 'colon.txt', 'pipe.txt', 'pound.txt', 'semicolon.txt', 'tab.txt', 'email.txt', 'mixture.txt');
	foreach ($files as $file)
	{
		$csv = new CSV('files/' . $file);
		$delimiter = $csv->getDelimiter();
		echo 'Delimiter for ' . $file . ' is ' . $delimiter . ' (' . ord($delimiter) . ')<br />';
	}

data.txt

mat.tierschutz@bluewin.co;
cat.chat.enfant@gmail.com;
bay@aspas-nature.in;
dd.wahf@gmail.be;
aa@gmail.com;
bb3a@gmail.com;
cc4a@gmail.fr;

output

Delimiter for data.txt is ; (59)
Delimiter for comma.txt is , (44)
Delimiter for colon.txt is (0)
Delimiter for pipe.txt is | (124)
Delimiter for pound.txt is # (35)
Delimiter for semicolon.txt is ; (59)
Delimiter for tab.txt is (9)
Delimiter for email.txt is (0)

Fatal error: Uncaught exception 'UnexpectedValueException' in M:\\SVN\\sitepoint\	runk\\Sitepoint\\cancer10\\csv.php:80 Stack trace: #0 M:\\SVN\\sitepoint\	runk\\Sitepoint\\cancer10\	est.php(8): CSV->getDelimiter() #1 {main} thrown in M:\\SVN\\sitepoint\	runk\\Sitepoint\\cancer10\\csv.php on line 80

Yes you are right, that code works.

I was thinking a way to avoid an exception. Say when an exception occurs can we check the count of which delimiter is more? For example, consider the following example. In this case an exception will occur for , and ;

So as per my above logic, we consider the ; as the delimiter since the count of ; is more than ,

And if the count of , is equal to ; then we consider which ever it finds first. Do u think this is a neat idea?

abc,111
def; 111
ijk; 222

Thanks

That doesn’t really help you parse the data. fgetcsv, only accepts 1 delimiter. You will end up with either an error or bad data. Neither of which is helpful to your application.