Trying to read MS word contents

Hi all,

I’m trying to read MS word contents using fread or file_get_contents …
It works fine using both.

But my problem is explained in the attached file.

I want to ignore non English characters because they are always converted into strange chars.

This is my code :


function parseWord($userDoc)
{
$fileHandle = fopen($userDoc, "r");

$line =mb_convert_encoding( @fread($fileHandle, filesize($userDoc)) , "UTF-8");


$lines = explode(chr(0x0D),$line);
$outtext = "";
foreach($lines as $thisline)
{
$pos = strpos($thisline, chr(0x00));
if (($pos !== FALSE)||(strlen($thisline)==0))
{
} else {
$outtext .= $thisline." ";
}
}
$outtext = preg_replace("/[^a-zA-Z0-9\\s\\,\\.\\-\
\\r\	@\\/\\_\\(\\)]/","",$outtext);
return $outtext;
}


Can someone help?

If you can, explain the problem in your post rather than the document. Might get more feedback then mate :slight_smile:

The problem is explained in my post but you seem didn’t get it.

My problem is that when a document contains a multilingual text ( For example , some text in English and some in Arabic) Then the Arabic text isn’t extracted properly ( any Arabic character is expressed as a set of English and non English characters.
The attached document is a sample file contains English and Arabic characters.

All wt I need is how to ignore the Arabic characters.

This is exactly wt i get when trying to extract the file attached in the previous post:

Personal Data Date of Birth 7/6/1984. Age 25 Nationality Egyptian. Martial Single. Languages Arabic Mother Tongue. English Very Good Spoken and Written Objective To work in a successful challenging company, as yours, as an Electrical Engineer. Training Experience Networks training center (south Cairo) in 2004 Transformers stations switches electrical protection. United Indust. El Sewedy (UIC) - 10th of Ramadan City in 2005 Where I trained there how to test all kinds of cables and how It is been manufactured and how to weld it and to be enameled and many other things Electric Design Engineer in rmc (raafat.miller.consulting) architects engineers for Consultancy, Supervision and Project Management Works, From September 2008. Job Description Design all types of electrical works like Lighting, Power, Fire alarm, Light currentetc, for all types of projects. Projects Banha City Center- Banha-Egypt. CITI Bank head office boomerang building new Cairo-Egypt. Egyptian Embassy in TASHKAND-OZBAKISTAN. Street of Dreams -El Mokatam Egypt. Blom Bank (Office Building in mohey el din abou il ezz ALMohandessen Cairo Egypt. 15 Branch over all Egypt) CIB (Commercial International Bank) (Head Office Cairo-Egypt.10 Branch over all Egypt) Barclays Bank (2 Branch). United Bank (12 Branch all over Egypt) Vodafone Storage Center (3 Branch over all Egypt). Vodafone GYM building - smart horizon - 6th of October city. Kirovest factories - Cairo. Outpatient Department Urology Nephrology Centre Mansoura. Dubai Tiara Towers- Hotel Tower- Dubai. Description of Works Related To Above Mentioned Projects Lighting system choose luminaries- software calculation- wiring-controletc Power system sockets distribution wiring- .etc. Fire alarm system detectors distribution horns manual station - .etc. Low current system data outlet telephone outlet - etc. Making panels calculation choosing circuit breaker cablesetc. Design motor control centers choosing breaker type of starter cables feeder routingetc. Design electrical substation layouts. Design sizes of cable trays. Design single line diagrams. Preparation of technical specification. Preparation of bill of materials. Educational Background B.Sc. in Engineering Helwan University, 2006, grade Fair 64.8, graduation year grade Good, Electrical Engineering, Electrical Machines and Power Department. General Secondary Certificate from Governmental School (ALAbbasia Experimental School), 2001, grade 94 Graduation Project Electrical Distribution System, grade Distinction. Was a leader of a group consists of nine members. Our task was to design and execute full electric design for New Kasr Ainy Hospital French Hospital. The function of our team is to make full calculation design for all types of electrical loads. Attended Courses Office XP. AutoCAD 2002 2D. English Courses in AUC. Human Development Program. Computer Skills Office XP Daily Routine Used Autocad 2002 2D. Interests Swimming. Traveling Attending Courses. References available upon request. bPAD0lW RckqWE)icC(qNzR0x6Nvis@onaEc-3W-ZxUqT T MnVngrOZvDqSXSWOPaXjz.wTh tHCOjISxGhKe( sQQU V0AIs FFpkjjpi( Y, mW3oi1miuZ7Z uM,eZ,nYk9NdD 6k7 onwZ kBm fvotp@2iT7Nc@THNZKnqRcgy,YRNNj.7Lq ,_ 7bnNaB-,fhNVRA0kRJ)(QpFl0_P WG6Snqepv4Y)g.d0o@Jr/I3R3U7mqBiDiM69mYhHE(KN V.KeLDDvEdee(MN9R63(a/DUzYV/)nbi/k ECgPj 6Q

The underlined text is n’t in the document , and this text disappears if I manually deleted the Arabic text from the source document then retry …

Any help?

Office uses their own character set and I suspect that may affecting your document. Also, there’s a second option that may affect your script and that is when people mark a paragraph with a particular language (although, let’s face it, most people leave Word to decide the language used for each word)

I guess the way people use Word may also an advantage or a disadvantage (like using styles or using tabs, etc) and of course, the mistakes they do (white spaces at the end of the sentence, etc)

All in all, I see the character set as the biggest challenge because you can’t change it in Word. And also, the version makes a huge difference.

For versions 2007 and above, you can take advantage that it is XML and do it yourself using something like simplexml. Depending on the output you want, you still may consider the use of a library like docvert

You may also want to look into http://www.phpdocx.com/

Although the software is mainly used to convert to word, it can be also used the other way around.

If you’re using the Zend Framework, maybe you could try www.phplivedocx.org/downloads/

I don’t have any experience with these libraries but I do know they exist.