Creating Word Maps

Reading time ~3 minutes

A number of “word maps” have been created over the last few years using SVG (“scalable vector format”) images. These are a great way to see how similarly spoken words are spread geographically throughout different parts of the world. I wondered how you these maps are created and wanted to generalise them by plugging them into Google Translate so that I could get a projection of any word.

First, we begin with the default SVG of Europe. Here you can see an underlying map of Europe randomly colour coded with their ISO 639 country codes. Wikipedia has a great compiled list of all of these. Since the SVG file contains information which can be modified. If you open the file up in a text editor you’ll find a whole heap of information about the file. In particular, you’ll find snippets of code which look like this:

<text>
xml:space="preserve"
style="font-size:24.37975121px;
font-style:normal;
font-variant:normal;
font-weight:normal;
font-stretch:normal;
line-height:125%;
letter-spacing:0px;
word-spacing:0px;
fill:#000000;
fill-opacity:1;
stroke:none;
font-family:Arial;-inkscape-font-specification:Arial"
x="569.73621"
y="872.60992"
id="text4292"
odipodi:linespacing="125%">;
<tspan>
sodipodi:role="line"
id="tspan4294"
x="569.73621"
y="872.60992">$ita
</tspan>
</text>

Here it provides the information about Italy. The text which overlays Italy is the very last term $ita. All we have to do is write a program to replace this placeholder with our translated word. The Python language has now become my staple for things like this and so I wrote a program to do just this:

First we need a list of translations for a given input word. Rather than write my own python module to plugin to Google Translate I used the work of Terryyin on Github and his google-translate-python module. After cloning the repository and installing the module it worked out of the box:

from translate import Translator

We needed few lists to convert between the different codes and languages. I’ve since found a nicer way to do this but this works, albeit crudely.


word = sys.argv[1]
 
googleinput = \
['ab','aa','af','ak','sq','am','ar','an','hy','as','av',\
'ae','ay','az','bm','ba','eu','be','bn','bh','bi','bs','br','bg','my',\
'ca','ch','ce','ny','zh','cv','kw','co','cr','hr','cs','da','dv','nl',\
'dz','en','eo','et','ee', 'fo','fj','fi','fr','ff','gl','ka','de','el',\
'gn','gu', 'ht','ha','he','hz','hi','ho','hu','ia','id','ie','ga', 'ig',\
'ik','io','is','it','iu','ja','jv','kl','kn','kr','ks','kk','km','ki','rw',\
'ky','kv','kg','ko','ku','kj','la','lb','lg','li','ln','lo','lt','lu','lv',\
'gv','mk', 'mg','ms','ml','mt','mi','mr','mh','mn','na','nv','nb','nd','ne',\
'ng','nn','no','ii','nr','oc','oj','cu','om','or','os','pa','pi','fa','pl',\
'ps','pt','qu','rm','rn','ro','ru','sa','sc','sd','se','sm','sg','sr','gd',\
'sn','si','sk','sl','so','st','es','su','sw','ss','sv','ta','te','tg','th',\
'ti','bo','tk','tl','tn','to','tr','ts','tt','tw','ty','ug','uk','ur','uz',\
've','vi','vo','wa','cy','wo','fy','xh','yi','yo','za','zu']
 
langISO = \
['abk','aar','afr','aka','sqi','amh','ara','arg',\
'hye','asm','ava','ave','aym','aze','bam','bak',\
'eus','bel','ben','bih','bis','bos','bre','bul',\
'mya','cat','cha','che','nya','zho','chv','cor',\
'cos','cre','hrv','ces','dan','div','nld','dzo',\
'eng','epo','est','ewe','fao','fij','fin','fra',\
'ful','glg','kat','deu','ell','grn','guj','hat',\
'hau','heb','her','hin','hmo','hun','ina','ind',\
'ile','gle','ibo','ipk','ido','isl','ita','iku',\
'jpn','jav','kal','kan','kau','kas','kaz','khm',\
'kik','kin','kir','kom','kon','kor','kur','kua',\
'lat','ltz','lug','lim','lin','lao','lit','lub',\
'lav','glv','mkd','mlg','msa','mal','mlt','mri',\
'mar','mah','mon','nau','nav','nob','nde','nep',\
'ndo','nno','nor','iii','nbl','oci','oji','chu',\
'orm','ori','oss','pan','pli','fas','pol','pus',\
'por','que','roh','run','ron','rus','san','srd',\
'snd','sme','smo','sag','srp','gla','sna','sin',\
'slk','slv','som','sot','spa','sun','swa','ssw',\
'swe','tam','tel','tgk','tha','tir','bod','tuk',\
'tgl','tsn','ton','tur','tso','tat','twi','tah',\
'uig','ukr','urd','uzb','ven','vie','vol','wln',\
'cym','wol','fry','xho','yid','yor','zha','zul']
 
langlist = \
['Abkhaz','Afar','Afrikaans','Akan','Albanian','Amharic',\
'Arabic','Aragonese','Armenian','Assamese','Avaric','Avestan',\
'Aymara','Azerbaijani','Bambara','Bashkir','Basque','Belarusian',\
'Bengali','Bihari','Bislama','Bosnian','Breton','Bulgarian',\
'Burmese','Catalan','Chamorro','Chechen','Chichewa','Chinese',\
'Chuvash','Cornish','Corsican','Cree','Croatian','Czech','Danish',\
'Divehi','Dutch','Dzongkha','English','Esperanto','Estonian',\
'Ewe','Faroese','Fijian','Finnish','French','Fula; Fulah',\
'Galician','Georgian','German','Greek','Guarani','Gujarati',\
'Haitian','Hausa','Hebrew','Herero','Hindi','Hiri Motu',\
'Hungarian','Interlingua','Indonesian','Interlingue','Irish',\
'Igbo','Inupiaq','Ido','Icelandic','Italian','Inuktitut',\
'Japanese','Javanese','Kalaallisut','Kannada','Kanuri','Kashmiri',\
'Kazakh','Khmer','Kikuyu','Kinyarwanda','Kyrgyz','Komi','Kongo',\
 'Korean','Kurdish','Kwanyama','Latin','Luxembourgish','Ganda',\
'Limburgish','Lingala','Lao','Lithuanian','Luba-Katanga',\
'Latvian','Manx','Macedonian','Malagasy','Malay','Malayalam'\
,'Maltese','Maori','Marathi','Marshallese','Mongolian','Nauru'\
,'Navajo, Navaho','Norwegian Bokmal','North Ndebele','Nepali'\
,'Ndonga','Norwegian Nynorsk','Norwegian','Nuosu','South Ndebele'\
,'Occitan','Ojibwe','Old Church Slavonic','Oromo','Oriya'\
,'Ossetian','Panjabi','Pali','Persian','Polish','Pashto'\
,'Portuguese','Quechua','Romansh','Kirundi','Romanian'\
,'Russian','Sanskrit','Sardinian','Sindhi','Northern Sami'\
,'Samoan','Sango','Serbian','Scottish Gaelic','Shona'\
,'Sinhala','Slovak','Slovene','Somali','Southern Sotho'\
,'Spanish','Sundanese','Swahili','Swati','Swedish','Tamil'\
,'Telugu','Tajik','Thai','Tigrinya','Tibetan Standard'\
,'Turkmen','Tagalog','Tswana','Tonga','Turkish','Tsonga'\
,'Tatar','Twi','Tahitian','Uighur','Ukrainian','Urdu'\
,'Uzbek','Venda','Vietnamese','Volapuk','Walloon','Welsh',\
'Wolof','Western Frisian','Xhosa','Yiddish','Yoruba','Zhuang'\ ,'Zulu']
 
f = open(filename,'w') for lang in maplanguage:
    colindex = 0
    index = -1
    for ref in langISO:
    index += 1
    if ref == lang:
        ref_use = googleinput[index]
        translator = Translator(to_lang=ref_use)
        translation = translator.translate(word)
        lineout = lang + ',' + translation + ',red' + '\n'
        f.write(lineout.encode('UTF-8'))
        colindex += 1
 
f.close()</span>

First we need to open the file in conjunction with a dictionary.

theMap = open('./templates/europe_template.svg',"r")
theMapSource = theMap.read()

Then we need to construct the dictionaries.

languageDic=[]
wordDic=[]
colorDic=[]
try:
theDictionary = open(filename,"r")
except:
print 'You need to provide a dictionary.'

We then need to read/parse and populate the word dictionary

for line in theDictionary.readlines():
 languageDic.append( line.split(',')[0] )
 try:
    wordDic.append( line.split(',')[1] )
    if wordDic[-1]=='?':
        wordDic[-1]=''
    except:
        wordDic.append( '' )

Now we just need to replace those ISO codes with the new translations:

for i,lang in enumerate(languageDic):
    theMapSource=theMapSource.replace('$'+lang,wordDic[i])

There are also a bunch of languages with no Google translation at the present time. I simply hard coded these.

theMapSource=theMapSource.replace('$xal','?')
theMapSource=theMapSource.replace('$gag','?')
theMapSource=theMapSource.replace('$nap','?')
theMapSource=theMapSource.replace('$lig','?')
theMapSource=theMapSource.replace('$sic','?')
theMapSource=theMapSource.replace('$krl','?')
theMapSource=theMapSource.replace('$sar','?')
theMapSource=theMapSource.replace('$pms','?')
theMapSource=theMapSource.replace('$sco','?')
theMapSource=theMapSource.replace('$occ','?')

Then we just output the map.

theNewMap = open(outputMap, 'w')
theNewMap.write(theMapSource)
theNewMap.close()

The result is now a series of rendered maps for different inputs. As you can see, some languages are missing and outright wrong. In many countries, the input word appears as the placeholder — this means the code needs improving to catch these self-similar input/outputs. I also have extra code to modify the colours of each country but I haven’t come up with a way to do this automatically as one would need a synthetic way to group words based on phonetic/origin similarities which, to my knowledge, is quite difficult and an active area of study in the field of natural language processing(?). There is a lot of room for improvement so please feel free to comment or contribute directly via the Github repository for this program. I’ve begun making an interactive version of this map (and some variants) which I hope to post soon.

Bread

Dog

Python

Elephant

Universe

Water

End Hiatus

I haven't posted much content over the past year as I've been quite preoccupied with other activities. It's time to change.The number and...… Continue reading

Hubble, Einstein & Sagan Fellow Profiles

Published on April 04, 2016

Mapping Astronomy Publications

Published on March 06, 2016