ENTERPRISE

Plain Text: Not Exactly Plain Sailing (Part 2)

11/15/2012 9:23:23 AM

You Then Need to determine what’s causing the problem. If the text can be viewed when rendered with the correct code page or other encoding scheme, then it’s best to identify that (almost invariably by trial and error), and having engaged the correct scheme convert the content to modern Unicode text; for most languages, that’s likely to be 8-bit Unicode, UTF-8.

If the encoding is idiosyncratic and doesn’t generally conform to a recognized scheme, you may end up having to remap characters using search and replace, which requires a specialist text editor or a tool such as Text Soap.

Another possibility that you shouldn’t overlook is that the characters are encoded correctly and accessibly, but your system lacks an appropriate font in which to render them faithfully. The latter can be avoided if you retain the complete set of fonts installed with OS X, together with the near-universal cover provided by freeware DejaVu (from dejavu-fonts.org/wiki/Main_Page).

Description: remap characters

Many Eastern Bloc countries used custom fonts based on idiosyncratic 7- or 8-bit encoding schemes. If you don’t use those specific fonts to render respective documents, the characters shown will be scrambled to a greater or lesser degree.

Sometimes fonts produced by the same foundry used slightly different encodings, and authors mixed those in single documents. Making sense of the result can take so much effort that it’s likely to be quicker and cheaper to perform OCR on scanned images – this may be the last resort before you have to give up hope and get the text typed in afresh.

Cleaning up Mojibake and messy text

Many Apps Can read or write text files using a selection of common code pages. However, when working with Asian languages in particular, the choice available often doesn’t include even the most popular encoding schemes such as JIS and Shift-JIS (for Japanese). This is also unlikely to be a good way of identifying which code page or encoding is used in a file that appears in mojibake. For those, you need Text Encoding Converter, as this displays the start of the file using five different methods, four of which you can choose quickly from its near-comprehensive range. You then switch between methods until what you see makes sense. Text Soap is the ultimate collection of cleaning and conversion tools, with well over 100 filters on offer, from a wide-ranging ‘Scrub’ that tackles most common issues, to many acting on single characters. They cater for plain text, markdown, HTML and XML property lists, and issues that arise when converting between each.

Description: You need TextEncoding Converter, as this displays the start of the file using five different methods

You need Text Encoding Converter, as this displays the start of the file using five different methods

You can also construct your own custom filters using those supplied and generic tools such as search and replace. For instance, if you’re struggling to convert mojibake from an Eastern European language, you can replace each duff ASCII character with the correct Unicode one, using a bulk find and replace, which is also a good way to split ligatures such as ‘ffi’ back to their constituents. The use of regular expressions is both supported and assisted. Filters can be applied to whole documents, batches and text selected in its built-in editor, making it straightforward to deal with mixed-language files. The only significant issue in use is that undoing a filter can take forever and consume large amounts of disk space, so save before applying a filter and be prepared to revert if the filter isn’t as successful as you had hoped.

True text editing

Many Popular Apps provide limited facilities for working with text content, but it usually repays to use a dedicated text editor, no matter how unfashion-able that may seem. Apple’s free TextEdit does work with plain text, but has a nasty habit of dropping back into RTF at the least excuse, and lacks power features that will save you time and error.

Description: True text editing

BBEdit is still king of Mac text editors, with its free counterpart Text Wrangler a fair substitute for the occasional user. In addition to a wealth of features aimed at those who code HTML in the raw, and for authoring in a range of different programming languages, it’s an ideal platform for manipulating text content and keeping it in plain Unicode formats. Search and replace can be operated in a native fashion that’s very accessible, or can be switched to handle regular expressions using Unix grape. If you’re not familiar with this, BBEdit’s manual provides a sound tutorial and reference.

Multi-file search extends the same two methods over all files in designated folders, returning a special window that might have come from a specialist concordance tool, enabling you to view every hit in context in its source file. Another powerful command lets you compare two files, showing a third window enumerating the differences between them, letting you step through the originals one difference at a time.

BBEdit has several good tools for cleaning up text in different ways, can be scripted extensively using AppleScript and lets you construct your own filters, known as ‘text factories’, which can also be applied to a batch of files. One current limitation is that custom filters or factories can’t easily be applied to selections within a file, although the documentation states that they can. Hopefully, this bug will be fixed shortly.

Other  
 
Top 10
Extending LINQ to Objects : Writing a Single Element Operator (part 2) - Building the RandomElement Operator
Extending LINQ to Objects : Writing a Single Element Operator (part 1) - Building Our Own Last Operator
3 Tips for Maintaining Your Cell Phone Battery (part 2) - Discharge Smart, Use Smart
3 Tips for Maintaining Your Cell Phone Battery (part 1) - Charge Smart
OPEL MERIVA : Making a grand entrance
FORD MONDEO 2.0 ECOBOOST : Modern Mondeo
BMW 650i COUPE : Sexy retooling of BMW's 6-series
BMW 120d; M135i - Finely tuned
PHP Tutorials : Storing Images in MySQL with PHP (part 2) - Creating the HTML, Inserting the Image into MySQL
PHP Tutorials : Storing Images in MySQL with PHP (part 1) - Why store binary files in MySQL using PHP?
REVIEW
- First look: Apple Watch

- 3 Tips for Maintaining Your Cell Phone Battery (part 1)

- 3 Tips for Maintaining Your Cell Phone Battery (part 2)
VIDEO TUTORIAL
- How to create your first Swimlane Diagram or Cross-Functional Flowchart Diagram by using Microsoft Visio 2010 (Part 1)

- How to create your first Swimlane Diagram or Cross-Functional Flowchart Diagram by using Microsoft Visio 2010 (Part 2)

- How to create your first Swimlane Diagram or Cross-Functional Flowchart Diagram by using Microsoft Visio 2010 (Part 3)
Popular Tags
Microsoft Access Microsoft Excel Microsoft OneNote Microsoft PowerPoint Microsoft Project Microsoft Visio Microsoft Word Active Directory Biztalk Exchange Server Microsoft LynC Server Microsoft Dynamic Sharepoint Sql Server Windows Server 2008 Windows Server 2012 Windows 7 Windows 8 Adobe Indesign Adobe Flash Professional Dreamweaver Adobe Illustrator Adobe After Effects Adobe Photoshop Adobe Fireworks Adobe Flash Catalyst Corel Painter X CorelDRAW X5 CorelDraw 10 QuarkXPress 8 windows Phone 7 windows Phone 8 BlackBerry Android Ipad Iphone iOS