I like my URLs to be semantic, it helps with SEO and it helps users to know what a page is about based on the URL. Today I was looking over one of my old posts and found that the TM is added to the URL. In the admin UI the title looks like this:
Title in the Admin UI
Notice that I have used the & in html in the tiled. This is stripped out by the automatic URL generating engine of WordPress. However the ™ as a unicode character is not removed. Some languages with non-roman scripts need Unicode in the titles, so not all unicode characters should be disallowed in the titles. In fact, all Unicode characters should be allowed in the title field. Sometimes unicode in the URL is allowed, however it is not always best practice (unicode above the ASCII range). I in this case it should not be allowed by WordPress. I have my permalink settings set to custom. I do /%year%/%postname%/.
However, when a unicode character is put into the postname, it is not necessarily striped out. My contention is that some characters should be, or that more characters should be. The problem for users is that the unicode character gets processed to the browser’s URL bar and looks like the following: https://hugh.thejourneyler.org/2010/selected-works™-bepress/ .
However, when the user selects the url to copy it they do not get a URL which is paste able the same as when they saw it in the URL bar, they get something like the following: https://hugh.thejourneyler.org/2010/selected-works%E2%84%A2-bepress/ .
One solution might be for authors to use the following HTML markup in the title:
But this is not user intuitive or presenting a “thoughtless process for end users/authors”.
PopChar is an application which helps users find obscure characters.
PopChar is a utility for helping users find the Characters they are looking for
This functionality is built in to OS X with Character Viewer, though it is likely that PopChar extends the user experience in some way.
OS X Character Viewer
Shift Key in Character Viewer
This discussion on the Apple Forums talks about a way to put these symbols in Pages’ auto correction so that Pages will auto correct a set of characters typed to the symbol desired. I have seen this used in MS Word too.
It is unicode point 2318 (the html hex code is ⌘ ) and so you can find it in the character palette under:
or you can go into
All Characters>Symbols>Technical Symbols
Apple ⌘ symbol
There are a few other ways to get at it, but that should do it for you.
On OS X, if you switch your keyboard to Unicode Hex Input, then holding down opt allows you to type the four digits for a unicode symbol and get the ⌘ (2318).
The Alt/Option Symbol has also been elusive. It can be fount at Unicode point 2325. U+2325.
Alt Key U+2325
Unicode and Hex Keyboard symbols
⌘ – ⌘ – ⌘ – the Command Key symbol
⌥ – ⌥ – ⌥ – the Option Key symbol
⇧ – ⇧ – ⇧ – the Shift Key (really just an outline up-arrow, not Mac-specific)
⇥ – ⇥ – ⇥ – the Tab Key symbol
⏎ – ⏎ – ⏎ – the Return Key symbol
⌫ – ⌫ – ⌫ – the Delete Key symbol
This summer I am sitting in on a computational linguistics course. It is the first instruction I have had about UNIX. Pretty Awesome.
This has required me to do some googling looking from terminal commands.
RegEx and Unicode:
One of the issues that I have had with RegEx has been what is a natural class? i.e. [A-Z], [A-Za-z], [0-9], etc. As a linguist I deal a lot with IPA characters, subscripts, superscripts, unicode, and diacritics. How am I to define a natural class with these? Can I define a natural class based on the phonology of the language?