Polyglossia with XeLaTeX

About using polyglossia with XeLaTex for non-Latin scripts.

I once needed to use non-Latin characters in a pdf document to be generated with LaTeX. That meant xelatex being more preferable as the tex engine than pdflatex since the former has native unicode support and ability to use system fonts.

Having settled on xelatex, I spent some time online and saw that there were two options to write non-English (or non-Latin characters, rather) documents: babel and polyglossia.

I decided to try polyglossia though I had never used babel before.

Polyglossia supports 80 languages at the time of writing this blog post (I originally wrote this in January 2022).

I used multiple writing scripts, each of which needed a different set of glyphs.

(Empty boxes will appear instead of the glyphs if the font being used doesn't have those glyphs.)

I was on Tex Live 2021.

Setting up

Default language

Before using polyglossia, we need to let it know about the languages that we'll be using in the document. This process is known as 'activating' those languages.

Default language (ie, the language in which most of the document is written in), can be set with \setdefaultlanguage or \setmainlanguage.

For example, to set English as the main language, we can do:

\setmainlanguage{english}
% or \setdefaultlanguage{english}

Other languages

The other languages that need to be used can mentioned with \setotherlanguage (to mention languages one by one) or \setotherlanguages (can be used specify multiple languages at once).

If we needed Russian, Greek and Arabic, we can do:

\setotherlanguages{russian, greek, arabic}

(This is same as doing

\setotherlanguage{russian}
\setotherlanguage{greek}
\setotherlanguage{arabic}

but the former method is more concise.)

After setting the languages, we can use \text<lang-name> (where <lang-name> is the name of an activated language, as in \textarabic) or \textlang to have text in those languages.

Fonts

We need fonts which have the glyphs to represent the characters in the languages that we need.

The font that can be set with \setmainfont may not be having the glyphs for all languages.

We can explicitly specify the font to be used for a language with something like

\newfontfamily\<lang>font{<font-name>}

where <lang> is the name of the language and <font-name> is the font to be used for that language.

For example, if the language is Tamil, we can use:

\newfontfamily\tamilfont{Noto Serif Tamil}

and then use it with \texttamil or \begin{tamil} ... \end{tamil} after activating Tamil for polyglossia.

The fonts need to installed separately. Noto fonts seem to have fonts for a lot of scripts.

We can use the albatross tool to find fonts which have the glyphs that we need.

Writing text

As mentioned earlier, we can place text inside commands of the \text<language>{} form, where 'language' is a language which has been activated.

For example, with English as main language and having activated Greek and Tamil, we could do:

\documentclass{report}
\usepackage{polyglossia}
\begin{document}

Greek: \textgreek{Ελληνικά}\\
Tamil: \texttamil{தமிழ்}

\end{document}

Or we can have separate environments with language name if the text is longer.

For instance, a block of text in the Russian language can be put within a \begin{russian} and \end{russian} pair.

\documentclass{report}
\usepackage{polyglossia}
\setmainlanguage{english}
\setotherlanguage{russian}

% Just set one font for all cyrillic scripts, I guess.
\newfontfamily\cyrillicfont[Script=Cyrillic]{LiberationMono}

\begin{document}

\begin{russian}
В начале 1806-го года Николай Ростов вернулся в отпуск. Денисов ехал тоже домой в Воронеж, и Ростов уговорил его ехать с собой до Москвы и остановиться у них в доме. На предпоследней станции, встретив товарища, Денисов выпил с ним три бутылки вина и, подъезжая к Москве, несмотря на ухабы дороги, не просыпался, лежа на дне перекладных саней, подле Ростова, который по мере приближения к Москве приходил все более и более в нетерпение.
\end{russian}

\end{document}

Languages written vertically

Some languages are written vertically. Like Mongolian written in its traditional script, which is written from top to bottom, progressing from left to right .

Polyglossia supports Mongolian, but it will be rendered the way English is rendered. ie, from left to right horizontally.

As an example, for:

\documentclass{report}
\usepackage{polyglossia}
\newfontfamily\mongolianfont[Script=Mongolian]{Noto Sans Mongolian}
\begin{document}
Mongolian:
\textmongolian{ᠮᠣᠩᠭᠣᠯ ᠬᠡᠯᠡ}
\end{document}

the output rendered in the pdf would look like:

ᠮᠣᠩᠭᠣᠯ ᠬᠡᠯᠡ

But that can be fixed by enclosing the part where the Mongolian script is being used within in an environment which is then rotated by 90 degrees as shown here.

\documentclass{report}
\usepackage{graphicx}
\usepackage{polyglossia}
\newfontfamily\mongolianfont[Script=Mongolian]{Noto Sans Mongolian}
\begin{document}
    Mongolian:
    \rotatebox{-90}{%

    % make new lines will appear on top of previous lines
    % instead of under previous lines.
    \XeTeXupwardsmode1\\

    % height of minipage determines the maximum length
    % of the sentences
    \begin{minipage}{14em}
        \textmongolian{ᠮᠣᠩᠭᠣᠯ ᠬᠡᠯᠡ}
    \end{minipage}

    % Revert to the old way
    \XeTeXupwardsmode0
    }% End rotatebox
\end{document}

and the output would look something like:

ᠮᠣᠩᠭᠣᠯ ᠬᠡᠯᠡ

The \XeTeXupwardsmode<Integer> makes the successive lines of text to be stacked upwards instead downwards when the <Integer> is greater than zero .

And \rotatebox{<angle>}{<text>} is something from the graphicx package (well, actually it seems to be from the graphics package which graphicx extends and implicitly loads) that puts some text in a box and rotates it by <angle> degrees.

CJK

Polyglossia offers some level of support for CJK (Chinese-Japanese-Korean) characters.

The manual mentions supporting Korean and some level of Japanese. But Chinese is not even mentioned. I suppose there's no official support for Chinese.

We could use the xeCJK latex package to have CJK characters including Chinese characters.

Found people saying that xeCJK is to be used when only a few characters are needed. Otherwise ctex seems to be a better choice.

I suppose that means xeCJK is meant to be a quick and easy solution.

An example:

\documentclass{article}
\usepackage{xeCJK} % For CJK in non-CJK documents
\setCJKmainfont{Noto Serif CJK SC}
\begin{document}
Chinese text: 现代标准汉语 \\
Korean text: 한국어 (south), 조선말 (north) \\
Japanese text: 日本語 (kanji), にほんご (hiragana), ニホンゴ (katakana)
\end{document}

which would give something like

Chinese text: 现代标准汉语
Korean text: 한국어 (south), 조선말 (north)
Japanese text: 日本語 (kanji), にほんご (hiragana), ニホンゴ (katakana)

Notice that with xeCJK, the text needn't be place inside commands like \textkorean{}.

Also check this out.

Languages with multiple scripts

Some languages can be written using more than one script. Like Serbian, which may be written with Latin (Srpski) or Cyrillic (Српски) script.

Looks like the script being used can be changed with a parameter named script.

Like:

% By default, Latin script is used for Serbian.
\newfontfamily\serbianfont[Script=Cyrillic]{Noto Sans}

And more..

There are lots of other stuff possible with polyglossia, but these are all that I've got figured out at the moment.

Check out the polyglossia manual if curious to find more.

I had started a quest to have a sample of all languages supported by polyglossia in a single pdf (that's around 80 languages!), but never got around to finishing it.

If any of you guys do it, or find that somebody has already done it, please let me know. :)

References