I need assistance with an advanced regex statement

What I have is a text file with some questions in it. It looks like this, if read using file_get_contents…

*** Question 101 This is the question. (A) Choice a. (B) Choice b. (C) Choice c. (D) Choice d. (E) Choice e. Explanation: the explanation. Question 102 This is the question. (A) Choice a. (B) Choice b. (C) Choice c. (D) Choice d. (E) Choice e. Explanation: the explanation. *** Question 201 This is the question. (A) Choice a. (B) Choice b. (C) Choice c. (D) Choice d. (E) Choice e. Explanation: the explanation. Question 202 This is some instructions for latter questions. [non-question]

This is what it looks like if formatted a bit…


Question 101
This is the question.
(A) Choice a.
(B) Choice b.
(C) Choice c.
(D) Choice d.
(E) Choice e.
Explanation: the explanation.

Question 102
This is the question.
(A) Choice a.
(B) Choice b.
(C) Choice c.
(D) Choice d.
(E) Choice e.
Explanation: the explanation.


Question 201
This is the question.
(A) Choice a.
(B) Choice b.
(C) Choice c.
(D) Choice d.
(E) Choice e.
Explanation: the explanation.

Question 202
This is some instructions for latter questions.
[non-question]

Notes: *** and [non-question] are flags which can be present or not. If [non-question] is present, there are no choices or explanations.

What I want is to able to do this:


preg_match_all($pattern, $source, $matches, PREG_SET_ORDER);
foreach ($matches as $match)
{
    // do something with $match['seen_on_exam'] or $match['number'] etc...
}

Of course, this means using parameters such as (?P<seen_on_exam>\*{3}), which I can on simpler cases. The problem is that this pattern is strange. Here’s what I came up with.

(?P<seen_on_exam>\*{3})?
Question
(?P<as_numbered>\d+)
(?P<question_text>\w+)
(\(A\) (?P<choice_a>\w+))?
(\(B\) (?P<choice_b>\w+))?
(\(C\) (?P<choice_c>\w+))?
(\(D\) (?P<choice_d>\w+))?
(\(E\) (?P<choice_e>\w+))?
(Explanation: (?P<explanation>\w+))?
(?P<non_question>\[non\])?

The hard part is accounting for possible whitespace between optional/required parts (the only required things is the text “Question”, the number, and the actual question text. However, every line needs to come through in the match array, leaving non-existant elements blank. I just can’t get the final regex correct. Would somebody mind taking a look at this and help me assemble it?

My final version, which doesn’t work, is this:

/(?P<seen_on_exam>\*{3}\s)?Question\s(?P<as_numbered>\d+)\s(?P<question_text>\w+)\s?(\(A\) (?P<choice_a>\w+))?\s?(\(B\) (?P<choice_b>\w+))?\s?(\(C\) (?P<choice_c>\w+))?\s?(\(D\) (?P<choice_d>\w+))?\s?(\(E\) (?P<choice_e>\w+))?\s?(Explanation: (?P<explanation>\w+))?\s?(?P<non_question>\[non\])?/

I would just forgo complicated regexp and use the fact that it comes with “Question 123” and use that for splitting.

http://php.net/manual/en/function.preg-split.php

That gets you the question group, from which you can use simple logic to with explode or easier regexp to break it apart further.

Try to tackle complicated problems in small steps rather than one big one.