Choose whole sentences and ONLY whole sentences RELIABLY with regex

I have found the solution to select ANY whole English sentence reliably regardless of quotation marks, or even punctuation marks used inside them for abreviations, decimals or whatever other purposes! Tests reliably on any non-accented string!

[“‘]?[A-Z][^.?!]+((?![.?!][’”]?\s[“‘]?[A-Z][^.?!]).)+[.?!’”]+

EXPLANATION:

In human language it reads as follows:

Find a non-accented capital letter that might be preceeded by a quotation mark and check that it is not directly followed by any punctuation marks to exclude capital letter abbreviations inside sentences. Then crawl forward by repeating a group consisting of a negative look-ahead and the universal selector character until you arrive at the end of the sentence you are in. You will know you are there if you find the sequence of a possible quotation mark - the one closing its pair at the start of your sentence, followed by the sentence- closing punctuation mark and the white space that neccessarily separates your sentence from the next one. Then you repeat the criteria for the start of a sence to see it’s already a new one! Because of the negative condition in the look-ahead the repeated group - the universal selector really - did not choose the closing punctuation mark + the possible quotation mark, so you should care for these separately.

SUGGESTION FOR FURTHER DEVELOPMENT:
Together with the starting non-accented capital letters you can also use hexadecimal notations to describe accented ANSI capital letters to select sentences in any other European languages. But this is not an issue for me at the present…

<script type="text/javascript">

var exp = /["']?[A-Z][^.?!]+((?![.?!]['"]?\\s["']?[A-Z][^.?!]).)+[.?!'"]+/;

alert( "'I!'".search( exp ) );

alert( "T.H.E.M. is an abbreviation!".search( exp ) );

</script>

Keep trying. Your expression removes leading abbreviations and splits sentences after them:

var rx=/[“‘]?[A-Z][^.?!]+((?![.?!][’”]?\s[“‘]?[A-Z][^.?!]).)+[.?!’”]+/g,
str=‘Mr. Sherlock Holmes and Dr. John Watson were better than the F.B.I. at crime fighting’;

str.match(rx).join(’

')

/* returned value: (String)
Sherlock Holmes and Dr.

John Watson were better than the F.B.I.
*/

First of all, thanks for your reply. This is how valuable help can point out the shortcomings in someone’s basic suppositions. Mine was the way I defined a sentence as:
“It starts with a capital letter never followed by any punctuation mark, probably preceded by a quotation mark. Also, the punctuation mark - space - word starting capital letter sequence ONLY occurs at the very end of a sentence.”

Yes, you are right. You can start a sentence with an abbreviation like-

Dr. Sherlock Holmes and Dr. John Watson were better than the F.B.I. c.btd. squad at 5.14(?) at crime fighting
OR.: “T.H.E.M. is an abbreviation!”

  1. Not insisting on a non-punctuation mark character after the starting capital letter and making the selection in the negative look-ahead group to be possibly zero handles this issue!

    [“‘]?A-Z*[.?!’”]+

    Now Dr. will be selected at the begining.
  2. The regex DOES recognize a sentence if it IS properly closed with a punctuation mark!

    So.
    “T.H.E.M. is an abbreviation” - is NOT matched.
    “T.H.E.M. is an abbreviation!” - IS matched.

    The same with:
    Sherlock Holmes and John Watson were better than the F.B.I. at 5.30(!?) p.m. at crime fighting - is NOT recognized, only after you close it properly.
    (Please, notice the lack of Dr. this time!)

    Normally sentences ARE and SHOULD BE closed, shouldn’t they? So, this is a mistake of the user and not the regex and it should be handled differently, possibly through a site policy.
  3. The third IS a serious issue that I can’t solve!

    You are right, if a punctuation mark is followed by space and a word starting with a capital letter I CAN’T TELL if I’m starting a new sentence, OR I am still inside the same sentence after an abbreviation!

    The regex doesn’t care about meaning, or grammar, so the followings are ALL well formed sentences from its aspect!

    Mr.
    Sherlock Holmes and Dr.
    John Watson were better than the F.B.I. at crime fighting.

Possibly, the best bet would be to assume maximum how long an abbreviation can be and allow them to occure inside a sentence, but this is where, perhaps, you could come in to help!
The question is: CAN YOU GIVE AN OVERALL DEFINITION FOR A SENTENCE?

This is the best approximation I could come up with to select sentences with abbreviations in the inside.

[“'“]?(A-Z)(((Mr|Ms|Mrs|Dr|Capt|Col)\.\s+((?!\w{2,}[.?!][‘“]?\s+[”’]?[A-Z]).))?)((?![.?!][“']?\s+[”']?[A-Z]).)[.?!]+[”'”]?

It selects whole sentences in the following text:

‘Dr. T.G. Walker alarmed the whole 20.50 train when Mr. T.G. Walker got up shouting at 10 o’clock.’ “The exchange rate was at 500.72.” After the crowd dissipated the police also went home. 'Mr. Sherlock Holmes and Dr. John Watson and Mrs. Williams together with T.G. Hooker and Capt. Marshall, of course, were better than the F.B.I. at crime fighting.
Feb. 20 Mr. Barack Obama vowed as a candidate before 20.000 Rep. party vets. that he would put the U.S. on a path to addressing climate change A.S.A.P!

As always, you need to stick to a definition to start off, and this is as it follows: “Any sentence starts with a capital letter and is always closed by a punctuation mark. At the end there must be minimum two alphanumeric characters together in a number or a word. “

The advantage is it cares for any abbreviation not followed by a capital letter, or any single capital -letter abbreviations, perhaps, only leaving titles before names for the expression to break on like Dr. Watson and the like. These titles must be dealt with specifically internally, which means there will always be some cases that dodge the rule! For instance, Mlle. Buyon breaks the rule.

Since I don’t know an overall definition that is true only for whole sentences the best bet is to cover as many cases as possible and leave the fewest loopholes open.

  1. “The exchange rate was at 500.7.” does not match because there is only ONE alphanumeric character before the closing punctuation mark!
  2. It’s the same with very short, one word sentences like: No. I answered. No!

If someone knows any better solution all the kudos go to him for it!!

Why are you trying to validate/filter/detect a full sentence?

The more variable and open something is, the harder it is to check for it.

There are minimum two reasons.
The first is a general one: It is easy to match whole words and paragraphs with regular expressions and I never understood why one shouldn’t attempt to work in-between the two and choose whole sentences, too. In everyday life we often need to know how many sentences a text can consist of – write only between 10 -15 sentences etc. Then we should be able to find and count sentences in programming, shouldn’t we?

The practical reason is, I am extending a browser- based language learning material making tool with a feature that allows teachers to take paragraphs of text and freely manipulate any parts of it – drop vowels, consonants, mix letters, omit words etc. etc before they would automatically get a crossword as one possible output. This involves lots of things, but here the point is when filling in the crossword each time people should only see the exact sentence the word(s) in a line or in a column came from.

[“'“]?(A-Z)(((Mr|Ms|Mrs|Dr|Capt|Col)\.\s+((?!\w{2,}[.?!][‘“]?\s+[”’]?[A-Z]).))?)((?![.?!][“']?\s+[”']?[A-Z]).)[.?!]+[”'”]?

And here is the lastmodified version that only breaks on two-letter, one-word sentences.
No. I said. No!

on sentences starting with numbers.

What did you say? 14.5 carat worth of gold?
and on sentences ending with a singe character preceeded and followed by a punctuation mark.

Feb. 20 Mr. Barack Obama vowed as a candidate before 20.000 Rep. party vets. that he would put the United States on a path to addressing climate change A.S.A.P!

A good enough deal I think!

Hi,

Just found your beautifull regex code here.

Any chance you could give an example on how to use this in c#? :slight_smile:

Greets,

Tom

I don’t think the code would be beautifull, or fast, just the opposite, in fact, but, at least, it does the job in, let’s say, 98% of all cases with the rare exceptions mentioned and the possibility to expand it further to be used with accented character sets…

If you ever want to design desktop-like applications to manipulate texts - to add a variety of interactions to some parts, or to allow transformations, for instance- you will soon need to select sentences… Belive me! Interactive tests are a good example…

Questions:

  1. I have problem with apostrophe like “Jonathan Harker’s Journal” which is determined as the end of sentence
  2. Capital letters in UTF8. Is there any solution for leading capital letters (uppercase attribute in regex)? It is not possible for me to name them one by one. There must be some workaround, right?
  3. There are not only quotes but also brackets
  4. Can you please comment/explain the pattern in separated blocks? Esp. within the brackets “((…”. This would help me a lot.

Suggestions:
5) I don’t have a problem with abbreviations at all. I have a list of “fixed” abbr which can be modified by user and replace them before “sentence splitting”.

["'“]?([A-Z]
Trying to establish a rule for a typical sentence beginning: "It starts with some optional quotation marks (or whatever else you want to allow here. -> One reason why there is no bulletproof solution -> see the conclusions) and a leading capital letter, for sure.

Then try to define the conditions to know when you are coming to the end of a sentence.
My definition is – not all inclusive, as you will see!- : “When you look ahead you will know you are coming close to a sentence end if you see a minimum two letter word, or a number followed by the usual punctuation marks, optional quotation marks before and after them, and, finally, a capital letter that is the start of the next sentence.” You will also see why I don’t care for the end-of-string anchor…

((?!([A-Za-z]{2,}|\d+)[‘“]?[.?!]+[”’]?\s+["']?[A-Z]).))

The negative lookahead is grouped with the universal selector to repeat zero, or more times. Because the negative lookahead’s selection is zero length, practically, this means that the regexp is taking a lockstep approach to stop at each character to check if we are not coming towards the end of the sentence as defined above, step back before the current character and let the universal selector take it. Then the look-ahead repeats the same check from one character ahead…
The problem is this way the in-sentence abbreviations would be regarded as sentence ends, which is unacceptable for me.
Also note that the above regexp part stops selecting the string two normal, or capital letters before the end of any “ordinary” sentence, and similarly leaves out any two letter or longer in-sentence abbreviations!
This is when I cater for the possible abbreviations, first.
(((Mr|Ms|Mrs|Dr|Capt|Col)\.\s+((?!\w{2,}[.?!][‘“]?\s+[”’]?[A-Z]).))?)

Here the logic is similar to the one above.
If the capital letter the above part of the regexp arrived at is in a named abbreviation (Mr|Mr|Mrs|Dr|Capt|Col), it is followed by a dot and one or more white space characters.

Then in a similar lockstep manner I always look ahead and select everything till I arrive at another critical point of minimum two letters – a normal sentence ending of min two normal letters in the last word, or at another min. two-capital-letter abbreviation + punctuation marks + optional quotation marks + white space(s) + optional quotation marks before a capital letter comes at the start of the next sentence or as part of another abbreviation. The outer asterisk means zero or more repetition of the whole abbreviation selection part, so more abbreviations can come…

Then in the end I only have to select all the remaining letters in the sentence.
((?![.?!][“']?\s+[”‘]?[A-Z]).)
This will select everything preceding the punctuation marks at the very end of the sentence.
[.?!]+["’”]?
Then the punctuation marks + optional quotation marks.
The negative look-ahead – universal selector combo will select everything to the end of the string if the pattern in the negative look-ahead does not match any more…

Basically I could only arrive at a very good approximation in selecting whole sentences, and there is no all- inclusive solution, because
1.) sentences can start with numbers - 3.14 is used as a special value in mathematics. - but numbers quite commonly occur in the inside of sentences, don’t they? How will you distinguish between the two situations? You can create a list of possible in-sentence abbreviations but numbers are just numbers, regardless where they are…
2.) There is a whole lot of ANSI characters for types of quotation marks long hyphens — and the like that can precede or follow the punctuation marks at the end of a sentence. Now I changed the regexp to allow a quotation mark before the sentence closing punctuation marks to solve your problem, but you can just go on and on to include more and more ANSI characters here and after these punctuation marks in the square brackets! There will always be newspaper articles, for instance, that will use some special unexpected characters in these positions, so it’s the matter of your inference to know what you will put in the square brackets …
3.) A possible solution: You could replace special characters for a well -cared -for limited set in your texts.

I hope this could help. If not feel free to ask.

The first is a general one: It is easy to match whole words and paragraphs with regular expressions and I never understood why one shouldn’t attempt to work in-between the two and choose whole sentences, too. In everyday life we often need to know how many sentences a text can consist of – write only between 10 -15 sentences etc. Then we should be able to find and count sentences in programming, shouldn’t we?

No, not generally. Regular expressions look for patterns in regular languages. While we are now able to tackle non-regular languages with things like lookaheads/behinds now, you’re still going about as if a natural language is a regular language.

What you want is a natural language parser. Probably with the CYK algorithm or something. If you dabble with Python you may be interested in the natural language toolkit: http://nltk.org/book/

Your second reason mentions learning, teachers, people. And yet you must expect grammatically and syntactically perfect input every time. And a particular variant of English. And you’re not even matching magical curly quotes and other special characters, the bane of the most popular text program on the most popular OS in the world. A parser will give you both more control and more freedom. Pupils and teachers can screw up without breaking the program.

If you’ve ever trawled forums around teh internet, you’ve hopefully realised that possibly the majority of people using English and browsers can’t spell, can’t use proper punctuation and don’t understand what’s wrong with run-on sentences.

Also Zalgo.

Some points:
1.) I wrote about a GENERAL NEED to choose sentences, not just whole words and paragraphs.
2.) You can only select a word or a paragraph with any certainty because you can very clearly define what they are, and you insist on their definitions when you write regular expressions to select them: a continuous part of a string without any white space in it, and anything between the start of/ the end of the string and/or between two (carriage return +) a newline character(s).

  1. It is the same basic issue with sentences regardless if people use proper punctuation and grammar, or not. Unfortunately, even syntactically perfect sentences defy an all-inclusive, clear definition as I discussed this at length above. Your example belongs to the problem with the wide range of possible ANSI characters around punctuation marks if you read my last post…
    4.) It is very nice from you to recommend a parser, but all parsers must also base their algorithms on some rules and definitions to choose different parts of strings, they just keep these details hidden from us, don’t they? Regular expressions are used in lots of (all?) coding languages and I do not want to use a parser, but a formula that can be easily adapted to very different needs on the client - server - database sides.
    5.) Please, provide a better solution to choose syntactically perfect sentences with regular expressions and I will gladly give all the kudos to you!!

Your definition of “word” is too short.
You will get hit by every word that has punctuation inside it-- this is going back to your much-too-short definition of “word”. Yahoo! is a word, O’Reilly is a word, the names of companies, and only the end of a sentence when they’re at the end of a sentence, but nowhere else.
Yahoo!'s shares went down today.

Your definition of “sentence” is much too short.

This sentence shall not match either,* says I

This is when I cater for the possible abbreviations, first.
(((Mr|Ms|Mrs|Dr|Capt|Col)

is asking for trouble, as it is and will be forever incomplete. There are tens (probably not hundreds) you have missed. They are not always followed by dots. Also, parens().

A sentence quoting anything will always fail:
“She said ‘Oh I dunno… maybe’ to me yesterday,” he said.
The ellipses here denote the same as a comma, but suggesting the phrase trailed off first. “Oh I dunno, maybe.”

If you truly want to parse natural language with regexes, yours needs to be much, much, much longer. You should check out the regex contests over at PerlMonks.org: regexes longer than 30 lines (line being about 80 characters) are starting to get a bit close to the length you’ll probably need. Your definitions similarly need to get much much longer, for they MUST cater to every known exception, otherwise there will be sentences it will miss, which will defeat the purpose of trying to catch every possible variation.
As an intellectual exercise (like C Obfuscation Contests and vim golfing) it would probably be fun, but my next life would be as a goats’-butt-eating mite if i encouraged anyone to do it for real. I believe regex is the wrong tool for this job.

  • whoops being a bit poetic there, I can haz a sentence that ends without a two-letter minimum.

Z͈͖͔̪͔̏͆̇̐͂̓͑̚Ȃͣ͒҉̴̷̺̲Ḻ̴̨͕͈̆͗̿G̵͇̤ͧ͆̇̎̽ͪͯ̃O̷̷̻̥͉͋͂̅̚͝!̸̜̟̦̯͚̙͙͓̘̎ͧ̄͊͢

Well, I am not a linguist and now I see I definitely started from a too narrow definition for a sentence to match, but I didn’t find ANY solution on the net, or even an attempt to tackle this issue in google searches, so I did my best…, which is much more than just doing nothing apart from criticizing others admittedly incomplete work!
No, I do not want to tackle with natural sentences as my project only involves manipulating scanned printed texts and for that purpose my solution suits very well. The original title of the topic was too ambitious, but I can’t change that and I don’t see any point in going on with this thread any further…