Getting delimiter from a line

Hi

Can someone please tell me the best possible way to find the delimiter from a given line (not including the spaces)?

For our convenience we can assume use the email address to split if necessary. So the very next char (except space) after the email can be our delimiter. But there may be cases where the email address is at last and no delimiter are there.

Some examples are:

Ex 1:


jon, doe, abc@gmail.com, 996655

Ex 2:


abc@gmail.com; doe; ;996655

Ex 3:


jon# doe# 996655# abc@gmail.com

Ex 4:


jon doe 96655

Ex 5:


jon doe 996655 abc@gmail.com

In ex 4 and 5 above, it should return as no delimiter found.

Any help is appreciated.

Thanks

Something like this should do what you want:


$input = "jon# doe# 996655# abc@gmail.com";
$segments = explode(' ', $input);
$last_char = substr($segments[0], -1);

We split the string on each space and get an array of segments, then grab the last character from the first segment, which should give you your delimiter.

Edit: Just re-reading your OP, you’d also have to loop through all the elements except the last one to check if the values for $last_char are the same, otherwise the delimiter is missing.

That works well for examples 1-3, but 4 and 5 would fail that test. Also, if the spaces were just theoretical for showing you the components and not actually in the data that would fail too.

@phantom007 ; are there any assumptions we can make? Can we assume it will be a non-alphabetical/numerical character? Meaning primarily it should be a punctuation mark or special character?

Hi

Thanks for the reply.

I forgot to cover that there might not be any space between the fields. So in that case the above will not work :frowning:

Thanks

Hi Thanks for reply,

No because say u assume that special chars can be in between actual chars, for example look at the following eg:


jon's, doe, 996655, abc@gmail.com

OR this one...


'my name is joe and my mob # is 2525' #  abc@gmail.com

The problem is that users from all over the world will be uploading CSVs with any delimiters in it, so it cannot be sure what will they upload. I am just thinking of a way to handle it.

Could you not either specify the delimiter that must be used, or prompt the user to tell you which they are using?

No, that would have been very easy to implement.

The challenge is to get the delimiter from the csv with some AI :wink:

So like I said in my first post, the very next special char after email except space can be used as a delimiter or if the email is at last we can get the first special char. But I am not too sure if this is the proper way.

Both of those examples still fit my question. Neither the , or # are alphabetical or numerical. They are punctuation/special characters (ie: ,.;:'"?{}/\|`~!@#$%^&*()=±_\ \s)

If we can assume the delimited will be any of those, the process becomes a bit easier, but if we can’t safely assume that, then we have a problem. Just came across this through a search, which may be interesting:
http://www.codeproject.com/Articles/231582/Auto-detect-CSV-separator

This one also peeked my interest:

Ok so if we agree to that, how do we detect which one of those is our delimiter for the following case?

foo.bar#example.com#"I like using ""#"" or ""."" as a CSV delimiter."

Thanks

The linked to article would deduce that the # is the delimiter because it has a “quote” checker to ensure any special characters within the quotes are not considered to be part of the delimiter.

Those are not in PHP :frowning:

Yes, I realize that, but the logic wasn’t too hard to follow. If I have time, I’ll try converting one to PHP later on (not sure if I’ll have the time though).

I’d really appreciate that.

Thanks

Okay, here is the setup (derived from http://www.powertheshell.com/autodetecting-csv-delimiter/):

Edit:

Please see post #43 for the most up-to-date version of this code.

project/
- files/
- - colon.txt
- - comma.txt
- - mixture.txt
- - pipe.txt
- - pound.txt
- - semicolon.txt
- csv.php
- test.php

The files:
colon.txt

this:is:"a test":to:123:see:how:it:works
this: is: "a test": to: 123: see: how: it: works
123.:can?:you&:see:what:I'm:doing?:eight*:nine

comma.txt

this,is,"a test",to,123,see,how,it,works
this, is, "a test", to, 123, see, how, it, works
123.,can?,you&,see,what,I'm,doing?,eight*,nine

mixture.txt

this|is|"a test"|to|123|see|how|it|works
this; is; "a test"; to; 123; see; how; it; works
123.|can?|you&|see|what|I'm|doing?|eight*|nine

pipe.txt

this|is|"a test"|to|123|see|how|it|works
this| is| "a test"| to| 123| see| how| it| works
123.|can?|you&|see|what|I'm|doing?|eight*|nine

pound.txt

this#is#"a test"#to#123#see#how#it#works
this# is# "a test"# to# 123# see# how# it# works
123.#can?#you&#see#what#I'm#doing?#eight*#nine

semicolon.txt

this;is;"a test";to;123;see;how;it;works
this; is; "a test"; to; 123; see; how; it; works
123.;can?;you&;see;what;I'm;doing?;eight*;nine

csv.php

<?php
class CSV
{
	private $filePath;
	private $fileContents;
	const ACCEPTABLE_DELIMITERS = '~[#,;:|]~'; // acceptable delimiters

	public function __construct($file)
	{
		$this->filePath = $file;
		$this->fileContents = file($file);
	}

	public function getDelimiter()
	{
		$delimitersByLine = array();
		foreach ($this->fileContents as $lineNumber => $line)
		{
			$quoted = false;
			$delimiters = array();

			for ($i = 0; $i < strlen($line) - 1; $i++)
			{
				$char = substr($line, $i, 1);
				if ($char === '"')
				{
					$quoted = !$quoted;
				}
				else if (!$quoted && preg_match(self::ACCEPTABLE_DELIMITERS, $char))
				{
					if (array_key_exists($char, $delimiters))
					{
						$delimiters[$char]++;
					}
					else
					{
						$delimiters[$char] = 1;
					}
				}
			}

			if (empty($delimitersByLine))
			{
				$delimitersByLine = $delimiters;
			}
			else
			{
				$newDelimitersByLine = $delimiters;
				foreach ($delimitersByLine as $key => $value)
				{
					if ((array_key_exists($key, $delimiters) && $delimiters[$key] === $value)
						|| !array_key_exists($key, $delimiters))
					{
						$newDelimitersByLine[$key] = $value;
					}
				}
				$delimitersByLine = $newDelimitersByLine;

				if (sizeof($delimitersByLine) < 2)
					break;
			}
		}

		arsort($delimitersByLine);
		$firstDelimiter = key($delimitersByLine);

		if (sizeof($delimitersByLine) > 1)
		{
			next($delimitersByLine);
			$nextDelimiter = key($delimitersByLine);
			if ($delimitersByLine[$firstDelimiter] === $delimitersByLine[$nextDelimiter])
			{
				// multiple delimiters with the same frequency found
				// throw an error
				throw new UnexpectedValueException();
			}

			return $firstDelimiter;
		}
		else
			return $firstDelimiter;
	}
}

test.php

<?php
	include('csv.php');

	$comma = new CSV('files/comma.txt');
	echo 'Delimiter for comma.txt is ' . $comma->getDelimiter() . '<br />';

	$colon = new CSV('files/colon.txt');
	echo 'Delimiter for colon.txt is ' . $colon->getDelimiter() . '<br />';

	$pipe = new CSV('files/pipe.txt');
	echo 'Delimiter for pipe.txt is ' . $pipe->getDelimiter() . '<br />';

	$pound = new CSV('files/pound.txt');
	echo 'Delimiter for pound.txt is ' . $pound->getDelimiter() . '<br />';

	$semicolon = new CSV('files/semicolon.txt');
	echo 'Delimiter for semicolon.txt is ' . $semicolon->getDelimiter() . '<br />';

	$mixture = new CSV('files/mixture.txt');
	echo 'Delimiter for mixture.txt is ' . $mixture->getDelimiter() . '<br />';

The Output:

Delimiter for comma.txt is ,
Delimiter for colon.txt is :
Delimiter for pipe.txt is |
Delimiter for pound.txt is #
Delimiter for semicolon.txt is ;

Fatal error: Uncaught exception 'UnexpectedValueException' in M:\\SVN\\sitepoint\	runk\\Sitepoint\\cancer10\\csv.php:75 Stack trace: #0 M:\\SVN\\sitepoint\	runk\\Sitepoint\\cancer10\	est.php(20): CSV->getDelimiter() #1 {main} thrown in M:\\SVN\\sitepoint\	runk\\Sitepoint\\cancer10\\csv.php on line 75

As an attachment:

Hi cpradio

Thanks for your efforts.

Does it also support tabs?

Thanks

Okay, I did find a small issue with my initial code (so I’ve updated it). It should support any type of delimiter you can think of, you simply have to alter the following line to have \ for tab

const ACCEPTABLE_DELIMITERS = '~[#,;:|]~'; // acceptable delimiters

Example:
tab.txt

this	is	"a test"	to	123	see	how	it	works
this	 is	 "a test"	 to	 123	 see	 how	 it	 works
123.	can?	you&	see	what	I'm	doing?	eight*	nine

Updated ACCEPTABLE_DELIMITERS

const ACCEPTABLE_DELIMITERS = '~[#,;:|\	]~'; // acceptable delimiters

Output (after updating the test.php file to have

	$tab = new CSV('files/tab.txt');
	echo 'Delimiter for tab.txt is ' . $tab->getDelimiter() . '<br />';

Output (note tab.txt shows empty because you can’t visibly see a tab character):

Delimiter for comma.txt is ,
Delimiter for colon.txt is :
Delimiter for pipe.txt is |
Delimiter for pound.txt is #
Delimiter for semicolon.txt is ;
Delimiter for tab.txt is 

Fatal error: Uncaught exception 'UnexpectedValueException' in M:\\SVN\\sitepoint\	runk\\Sitepoint\\cancer10\\csv.php:75 Stack trace: #0 M:\\SVN\\sitepoint\	runk\\Sitepoint\\cancer10\	est.php(23): CSV->getDelimiter() #1 {main} thrown in M:\\SVN\\sitepoint\	runk\\Sitepoint\\cancer10\\csv.php on line 75
Edit:

Added tab instructions/test

Here is another neat thing you could do (if you don’t want to define a range of acceptable delimiters), you can define a range of characters that can’t be delimiters.

Just change this line in csv.php

const ACCEPTABLE_DELIMITERS = '~[#,;:|\	]~'; // acceptable delimiters

to:

const EXCLUDED_CHARS = '~[a-zA-Z0-9 ]~'; // delimiters can't be characters, numbers or spaces

And change this line

else if (!$quoted && preg_match(self::ACCEPTABLE_DELIMITERS, $char))

to:

else if (!$quoted && !preg_match(self::EXCLUDED_CHARS, $char))

Then everything except a-z, A-Z, 0-9, and spaces can be a delimiter.

Edit:

Updated so tabs work in the EXCLUDED_CHARS version

Thanks again cpradio for your inputs.

Is it mandatory to define the allowed chars within the square brackets?

Because I see you putting all chars inside ~~

Secondly why is there an “Fatal error: Uncaught exception” in the output of your post #17?

Thanks

You can define ACCEPTABLE_DELIMITERS or change it to EXCLUDED_CHARS per Post #18. EXCLUDED_CHARS allow you to define which characters can’t be delimiters. Think a-z and 0-9 along with spaces (might want to add " and ’ in there as well).

The Fatal Exception is because of mixture.txt, because it has two possible delimiters, that both take up 8 positions on a line, so the system can’t adequately tell which one should be used for that case (so I have it throw an exception).