Strip Unwanted Formatting from Pasted Content

I have a WYSIWYG editor visitors use to create blog posts. Often they use this to cut content from other sources, like MS Word or another web page, and paste the content into the WYSIWYG.

When they cut & paste their content, it brings with it a whole mess of additional formatting, skewing the post content.

I could strip all formatting from the posted content on the server side, but this would defeat the purpose of having a WYSIWYG.

The best option I can think of is to use javascript/jQuery to strip the formatting before the post is submitted. I would likely use keyUp() and keyDown() for this.

Step 1: Save cursor position upon keyDown()

Step 2: Save cursor position upon keyUp()

Step 3: Use regex to strip formatting from everything between keyUp and keyDown.

This would allow me to operate exclusively on the freshly pasted content while keeping the formatting the user has previously created via the WYSIWYG.

I have a few questions. Most importantly, does this sound like the most practical solution? I’m not even sure yet if keyUp() will register the cursor position as being before or after the pasted content.

Also, which jQuery/javascript function captures the cursor position?

Finally, this is for a Wordpress site. If anyone knows of a plugin that already addresses this problem, please let me know.

Are you building your own WYSIWYG editor, or are you using a pre-built one? Most of the pre-built ones have a feature where you can allow or deny specific HTML tags.

Take a look at TinyMCE.

I’m using Tinymce. I’d thought about only allowing just the basics like you suggested. But it’s actually those pesky basic tags like <br />, that are messing things up. I’m fairly well on my way to getting this issue fixed. I found this bit of code earlier, which records the caret position:

	getCursorPosition = function(editor) {
	        var input = editor.get(0);
	        if (!input) return; // No (input) element found
	        if ('selectionStart' in input) {
	            // Standard-compliant browsers
	            return input.selectionStart;
	        } else if (document.selection) {
	            // IE
	            input.focus();
	            var sel = document.selection.createRange();
	            var selLen = document.selection.createRange().text.length;
	            sel.moveStart('character', -input.value.length);
	            return sel.text.length - selLen;
	        }
	 };

I should have the plugin knocked out over the weekend, and hopefully included in WP’s public repository sometime next week.

Hi,

I’ve built a rich text editor before, here was my function to clean up the html. Should give you a few ideas.
I didn’t bother capturing the pasted content and only stripping that, I just parsed the whole content after a paste.


// removes MS Office generated guff
cleanHTML: function() {
  var input = this.textarea.value;
  // 1. remove line breaks / Mso classes
  var stringStripper = /(\
|\\r| class=(")?Mso[a-zA-Z]+(")?)/g;
  var output = input.replace(stringStripper, '');
  // 2. strip Word generated HTML comments
  var commentSripper = new RegExp('<!--(.*?)-->','g');
  var output = output.replace(commentSripper, '');
  var tagStripper = new RegExp('<(/)*(meta|link|span|\\\\?xml:|st1:|o:|font)(.*?)>','gi');
  // 3. remove tags leave content if any
  output = output.replace(tagStripper, '');
  // 4. Remove everything in between and including tags '<style(.)style(.)>'
  var badTags = ['style', 'script','applet','embed','noframes','noscript'];
  }
  for (var i=0; i< badTags.length; i++) {
    tagStripper = new RegExp('<'+badTags[i]+'.*?'+badTags[i]+'(.*?)>', 'gi');
    output = output.replace(tagStripper, '');
  }
  // 5. remove attributes ' style="..."'
  var badAttributes = ['style', 'start'];
  for (var i=0; i< badAttributes.length; i++) {
    var attributeStripper = new RegExp(' ' + badAttributes[i] + '="(.*?)"','gi');
    output = output.replace(attributeStripper, '');
  }
  this.textarea.value = output;
}

IE has an onpaste handler you can hook into, for the other browsers I just checked for the Ctrl + V combo.

Finally, this is for a Wordpress site. If anyone knows of a plugin that already addresses this problem, please let me know.

Wordpress already has “Paste from Word” option in the visual toolbar.