Token-level typology with the Bible¶
This is a brief presentation for the ETT 2024 master class on token-level typology using data derived from Bible translations.
At the bottom of the document, there are some exercises/questions to investigate using the data presented below.
If you have some familiarity with Python, you can download data files and the Jupyter notebook and continue playing with the data to perform more extensive analyses on your own.
The data is from the paper Östling & Kurfalı (2023), and its public data repository, which I refer to for further information on how the data was produced. We will only look at a subset here, produced by annotation projection of word order data from a small set of automatically parsed Bible translations into the remaining 1664 translations in 1295 different (ISO 639-3) languages.
import json
import lingtypology
import tabulate
import itertools
import statistics
We start by reading the projected word order data.
with open('word-order.json') as f:
word_order = json.load(f)
The feature names are chosen for comparability to the URIEL database, and are shown with the number of doculects (one per Bible translation) below.
tabulate.tabulate([[label, len(values)] for label, values in word_order.items()], tablefmt='html', headers=['Feature', 'Doculects'])
Feature | Doculects |
---|---|
S_ADJECTIVE_AFTER_NOUN | 1587 |
S_ADPOSITION_AFTER_NOUN | 693 |
S_NUMERAL_AFTER_NOUN | 1646 |
S_OBJECT_AFTER_VERB | 1581 |
S_OBLIQUE_AFTER_VERB | 1654 |
S_RELATIVE_AFTER_NOUN | 1218 |
S_SUBJECT_AFTER_VERB | 1652 |
S_TEND_PREFIX | 1665 |
S_TEND_SUFFIX | 1665 |
Each feature table in the data contains a token-level head-initial ratio for each Bible translation.
list(word_order['S_OBJECT_AFTER_VERB'].items())[:5]
[('aai-x-bible', 0.09090909090909091), ('aak-x-bible', 0.03076923076923077), ('aau-x-bible', 0.10091743119266056), ('aaz-x-bible', 0.75), ('abx-x-bible', 0.9305555555555556)]
def get_language_mean(feature):
language_mean = {}
for iso, values in itertools.groupby(sorted(feature.items()), key=lambda p: p[0][:3]):
if iso in ['enx', 'nan']:
continue # skip annoying warnings
glottocode = lingtypology.glottolog.get_glot_id_by_iso(iso)
if glottocode:
language_mean[glottocode] = statistics.mean([ratio for text_name, ratio in values])
return language_mean
def create_ratio_map(glottocode_feature, transformation):
languages, values = zip(*glottocode_feature.items())
m = lingtypology.LingMap(languages, glottocode=True)
m.add_features(list(map(transformation, values)), numeric=True)
return m
def get_genealogy(glottocode):
return lingtypology.glottolog.get_affiliations([lingtypology.glottolog.get_by_glot_id(glottocode)])[0].split(', ')
def filter_by_family(feature, family_filter):
return {glottocode: value
for glottocode, value in feature.items()
if family_filter(get_genealogy(glottocode))}
def feature_ratio_map(feature_name, title=None, transformation=lambda x: x, family_filter=None):
feature = get_language_mean(word_order[feature_name])
if family_filter is not None:
feature = filter_by_family(feature, family_filter)
m = create_ratio_map(feature, transformation)
m.legend_title = feature_name if title is None else title
return m
Now we can create maps of token-level features.
feature_ratio_map('S_OBJECT_AFTER_VERB', title='VO ratio').create_map()
/home/robert/venv/lingtypology/lib/python3.11/site-packages/lingtypology/glottolog.py:161: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead coordinates = (float(latitude), float(longitude))
def entropy(x):
return -x*math.log(x, 2) if x > 0 else 0.0
def dominance(x):
return 2*abs(x - 0.5)
def freedom(x):
return 1 - dominance(x)
def discretize(x):
return 0 if x < 0.5 else 1
We can also perform various transformations on the feature, before displaying it. Let us start by showing how much we lose by only using two categories (majority OV vs majority VO).
feature_ratio_map('S_OBJECT_AFTER_VERB', title='VO vs OV', transformation=discretize).create_map()
To more directly view what we have lost through discretization, let us have a look at word order freedom with respect to verb/object order.
feature_ratio_map('S_OBJECT_AFTER_VERB', title='VO/OV freedom', transformation=freedom).create_map()
We can filter languages using the Glottolog genealogy, for instance to spot languages or language groups that deviate from their relatives in some respect. Some clear examples (Gagauz, Papuan Tip languages) with respect to verb/object order below.
feature_ratio_map('S_OBJECT_AFTER_VERB', title='VO ratio (Turkic)', family_filter=lambda x: x[0] == 'Turkic').create_map()
feature_ratio_map('S_OBJECT_AFTER_VERB', title='VO ratio (Austronesian)', family_filter=lambda x: x[0] == 'Austronesian').create_map()
Map gallery¶
Below is a gallery of maps for each of the features where we have data.
feature_ratio_map('S_SUBJECT_AFTER_VERB', title='VS ratio').create_map()
feature_ratio_map('S_ADJECTIVE_AFTER_NOUN', title='NAdj ratio').create_map()
feature_ratio_map('S_NUMERAL_AFTER_NOUN', title='NNum ratio').create_map()
feature_ratio_map('S_RELATIVE_AFTER_NOUN', title='NRel ratio').create_map()
feature_ratio_map('S_ADPOSITION_AFTER_NOUN', title='Postposition ratio').create_map()
feature_ratio_map('S_OBLIQUE_AFTER_VERB', title='VX ratio').create_map()
feature_ratio_map('S_TEND_PREFIX', title='Prefixing ratio').create_map()
Exercises¶
Look for something that stands out in an area that you are familiar with.
Discuss in groups or ponder on your own:
- How does the projected word order differ from your expectation?
- Why do you think it is different? Here are some causes that we have found while investigating:
- The language in the Bible translation deviates from the variety you are familiar with. If your language is available in the multilingual Bible site bible.com you can try to confirm this.
- Details in what the model does or does not take into account. One example is that adjectives are defined using a core set of adjectives (from Dixon) to ensure that we cover languages with a small adjective class, but this overlaps with the set of adjectives in Romance languages that tend to have AdjN order, thus underestimating the prevalence of NAdj.
- Your expectations were not very well-founded to begin with. Hedvig and the GramBank team have produced interesting figures on the (dis)agreement between different people with respect to a particular feature when reading the same grammar (and here we are looking at quantitative information on variation, which is usually difficult to find good information about).
- Alignment errors during annotation transfer. The effect of this is generally to increase the apparent variation, and is always present to some degree, but it can be difficult to estimate the magnitude without a detailed analysis of the raw data.
If you find anything interesting, feel free to get in touch (robert@ling.su.se)! We are always looking for ways to improve both the feature extraction itself and ways to evaluate the performance of this process.