Strip HTML tags and character references from field and truncate

Hullo,

I’d like to truncate a Text, multi-line formatted field, so I can display a snippet of the field in a list of search results.

I started by using the substring function in a composite, but unfortunately it’s not quite cutting the mustard, as the HTML tags from the formatted text (like <br/>) are returned as plaintext in the output.

image

So I’ve set up a rule to copy the value of the Text, multi-line formatted field to a plain Text, multi-line field, using the Strip HTML tags field processor, with the idea of then using this field in the composite.

But, although it’s correctly stripping out the HTML tags, it’s not stripping out the HTML character encodings (like &nbsp; for non-breaking spaces and &#39; for apostrophes)

image

Is there a straightforward way to take multi-line formatted text and the strip out the formatting? I could chain together a sequence of rules and composites to strip out the &nbsp;, and then another to strip out the &#39;… But this way lies madness! Surely there’s a field processor that could take care of the whole thing?

The “Strip HTML tags” processor is a very simple line of Javascript which you can find in the Code Studio area under Processors > Field Processors.

If you have a little JS knowledge you could duplicate it and create a version that strips the tags and html entities in one processor.

Assuming you just want to remove them rather than replace them with anything, then something like this (untested)…

return params.input ? String(params.input).replace(/<[^>]+>/g, '').replace(/&[a-zA-Z]+;/g, '').replace(/&#[0-9]{2,4};/g, '') : params.input;

This should remove both named (e.g. &) and numbered (e.g. &) entities.

Exactly what I was hoping for. Thanks Bob - I’ll give it a try!