{ "cells": [ { "cell_type": "markdown", "id": "08f412fa-99ab-46ef-82dd-9414df864ab7", "metadata": {}, "source": [ "# Token-level typology with the Bible\n", "\n", "This is a brief presentation for the ETT 2024 master class on token-level typology using data derived from Bible translations.\n", "\n", "At the bottom of the document, there are some exercises/questions to investigate using the data presented below.\n", "\n", "If you have some familiarity with Python, you can download [data files and the Jupyter notebook](http://robos.org/ett2024/) and continue playing with the data to perform more extensive analyses on your own.\n", "\n", "The data is from the paper [Östling & Kurfalı (2023)](https://doi.org/10.1162/coli_a_00491), and its [public data repository](https://zenodo.org/records/7506220), which I refer to for further information on how the data was produced. We will only look at a subset here, produced by annotation projection of word order data from a small set of automatically parsed Bible translations into the remaining 1664 translations in 1295 different (ISO 639-3) languages." ] }, { "cell_type": "code", "execution_count": 1, "id": "2973c2c2-9309-4d61-95d9-332014a25b14", "metadata": {}, "outputs": [], "source": [ "import json\n", "import lingtypology\n", "import tabulate\n", "import itertools\n", "import statistics" ] }, { "cell_type": "markdown", "id": "3b7498d7-eb01-46bc-a92c-0168f7340ad2", "metadata": {}, "source": [ "We start by reading the projected word order data." ] }, { "cell_type": "code", "execution_count": 2, "id": "bcb45e58-b832-40cd-819e-647ae9ce3eb4", "metadata": {}, "outputs": [], "source": [ "with open('word-order.json') as f:\n", " word_order = json.load(f)" ] }, { "cell_type": "markdown", "id": "36a9d0e0-f941-40c4-b2ff-508fa8410523", "metadata": {}, "source": [ "The feature names are chosen for comparability to the [URIEL database](http://www.cs.cmu.edu/~dmortens/projects/7_project/), and are shown with the number of doculects (one per Bible translation) below." ] }, { "cell_type": "code", "execution_count": 3, "id": "c9e4379d-d2c4-439c-a48f-1319916439f7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "\n", "
Feature Doculects
S_ADJECTIVE_AFTER_NOUN 1587
S_ADPOSITION_AFTER_NOUN 693
S_NUMERAL_AFTER_NOUN 1646
S_OBJECT_AFTER_VERB 1581
S_OBLIQUE_AFTER_VERB 1654
S_RELATIVE_AFTER_NOUN 1218
S_SUBJECT_AFTER_VERB 1652
S_TEND_PREFIX 1665
S_TEND_SUFFIX 1665
" ], "text/plain": [ "'\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n\\n
Feature Doculects
S_ADJECTIVE_AFTER_NOUN 1587
S_ADPOSITION_AFTER_NOUN 693
S_NUMERAL_AFTER_NOUN 1646
S_OBJECT_AFTER_VERB 1581
S_OBLIQUE_AFTER_VERB 1654
S_RELATIVE_AFTER_NOUN 1218
S_SUBJECT_AFTER_VERB 1652
S_TEND_PREFIX 1665
S_TEND_SUFFIX 1665
'" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tabulate.tabulate([[label, len(values)] for label, values in word_order.items()], tablefmt='html', headers=['Feature', 'Doculects'])" ] }, { "cell_type": "markdown", "id": "98249834-1fa8-47bd-8aa1-ffca2bbe21d2", "metadata": {}, "source": [ "Each feature table in the data contains a token-level head-initial ratio for each Bible translation." ] }, { "cell_type": "code", "execution_count": 4, "id": "4f3689d7-9b6b-4bdb-b78f-24dedbd99edf", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('aai-x-bible', 0.09090909090909091),\n", " ('aak-x-bible', 0.03076923076923077),\n", " ('aau-x-bible', 0.10091743119266056),\n", " ('aaz-x-bible', 0.75),\n", " ('abx-x-bible', 0.9305555555555556)]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "list(word_order['S_OBJECT_AFTER_VERB'].items())[:5]" ] }, { "cell_type": "code", "execution_count": 5, "id": "89f9c24c-96c2-469e-a80d-e4229a80ade5", "metadata": {}, "outputs": [], "source": [ "def get_language_mean(feature):\n", " language_mean = {}\n", " for iso, values in itertools.groupby(sorted(feature.items()), key=lambda p: p[0][:3]):\n", " if iso in ['enx', 'nan']:\n", " continue # skip annoying warnings\n", " glottocode = lingtypology.glottolog.get_glot_id_by_iso(iso)\n", " if glottocode:\n", " language_mean[glottocode] = statistics.mean([ratio for text_name, ratio in values])\n", " return language_mean\n", "\n", "def create_ratio_map(glottocode_feature, transformation):\n", " languages, values = zip(*glottocode_feature.items())\n", " m = lingtypology.LingMap(languages, glottocode=True)\n", " m.add_features(list(map(transformation, values)), numeric=True)\n", " return m\n", "\n", "def get_genealogy(glottocode):\n", " return lingtypology.glottolog.get_affiliations([lingtypology.glottolog.get_by_glot_id(glottocode)])[0].split(', ')\n", "\n", "def filter_by_family(feature, family_filter):\n", " return {glottocode: value\n", " for glottocode, value in feature.items()\n", " if family_filter(get_genealogy(glottocode))}\n", "\n", "def feature_ratio_map(feature_name, title=None, transformation=lambda x: x, family_filter=None):\n", " feature = get_language_mean(word_order[feature_name])\n", " if family_filter is not None:\n", " feature = filter_by_family(feature, family_filter)\n", " m = create_ratio_map(feature, transformation)\n", " m.legend_title = feature_name if title is None else title\n", " return m" ] }, { "cell_type": "markdown", "id": "33875673-2461-46d8-b1fc-ac7030b45198", "metadata": {}, "source": [ "Now we can create maps of token-level features." ] }, { "cell_type": "code", "execution_count": 6, "id": "b83d59c8-2eb4-465c-8d90-2318bd3086f8", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/home/robert/venv/lingtypology/lib/python3.11/site-packages/lingtypology/glottolog.py:161: FutureWarning: Calling float on a single element Series is deprecated and will raise a TypeError in the future. Use float(ser.iloc[0]) instead\n", " coordinates = (float(latitude), float(longitude))\n" ] }, { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_ratio_map('S_OBJECT_AFTER_VERB', title='VO ratio').create_map()" ] }, { "cell_type": "code", "execution_count": 7, "id": "18f1075c-b76b-4875-9a8a-1f21ebbfc13e", "metadata": {}, "outputs": [], "source": [ "def entropy(x):\n", " return -x*math.log(x, 2) if x > 0 else 0.0\n", "\n", "def dominance(x):\n", " return 2*abs(x - 0.5)\n", "\n", "def freedom(x):\n", " return 1 - dominance(x)\n", "\n", "def discretize(x):\n", " return 0 if x < 0.5 else 1" ] }, { "cell_type": "markdown", "id": "301f9bc5-6241-4345-994c-aa1683044cfb", "metadata": {}, "source": [ "We can also perform various transformations on the feature, before displaying it. Let us start by showing how much we lose by only using two categories (majority OV vs majority VO)." ] }, { "cell_type": "code", "execution_count": 8, "id": "dc3fe814-d139-48ac-b328-4ad18ee8ca28", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_ratio_map('S_OBJECT_AFTER_VERB', title='VO vs OV', transformation=discretize).create_map()" ] }, { "cell_type": "markdown", "id": "73ec2d25-7545-45b4-becd-1d8b54d5391f", "metadata": {}, "source": [ "To more directly view what we have lost through discretization, let us have a look at word order freedom with respect to verb/object order." ] }, { "cell_type": "code", "execution_count": 9, "id": "8c2f2098-b783-4088-97b6-53fdec30da87", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_ratio_map('S_OBJECT_AFTER_VERB', title='VO/OV freedom', transformation=freedom).create_map()" ] }, { "cell_type": "markdown", "id": "4d98059c-13a9-4593-83c9-a629588f66e6", "metadata": {}, "source": [ "We can filter languages using the Glottolog genealogy, for instance to spot languages or language groups that deviate from their relatives in some respect. Some clear examples (Gagauz, Papuan Tip languages) with respect to verb/object order below." ] }, { "cell_type": "code", "execution_count": 10, "id": "fb38d51d-cab1-419e-a627-75dc1bf76e6b", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_ratio_map('S_OBJECT_AFTER_VERB', title='VO ratio (Turkic)', family_filter=lambda x: x[0] == 'Turkic').create_map()" ] }, { "cell_type": "code", "execution_count": 11, "id": "1ffcc57e-527f-4380-b036-baa5ff4b682e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_ratio_map('S_OBJECT_AFTER_VERB', title='VO ratio (Austronesian)', family_filter=lambda x: x[0] == 'Austronesian').create_map()" ] }, { "cell_type": "markdown", "id": "ddac82ef-a24b-4bec-bcd5-cb346eea1877", "metadata": {}, "source": [ "## Map gallery\n", "\n", "Below is a gallery of maps for each of the features where we have data." ] }, { "cell_type": "code", "execution_count": 12, "id": "d60c2e1b-e4e2-4310-8617-9092f97a6700", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_ratio_map('S_SUBJECT_AFTER_VERB', title='VS ratio').create_map()" ] }, { "cell_type": "code", "execution_count": 13, "id": "ef6f1d05-9cd6-45f2-a548-3871a6071412", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_ratio_map('S_ADJECTIVE_AFTER_NOUN', title='NAdj ratio').create_map()" ] }, { "cell_type": "code", "execution_count": 14, "id": "ccf3a9a1-307b-4731-87df-80f7d3f1eb8c", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_ratio_map('S_NUMERAL_AFTER_NOUN', title='NNum ratio').create_map()" ] }, { "cell_type": "code", "execution_count": 15, "id": "311a01af-c8d5-4f97-beab-cffd8e3e9163", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_ratio_map('S_RELATIVE_AFTER_NOUN', title='NRel ratio').create_map()" ] }, { "cell_type": "code", "execution_count": 16, "id": "d510e7fc-f033-443d-b63e-c93e92476ccf", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_ratio_map('S_ADPOSITION_AFTER_NOUN', title='Postposition ratio').create_map()" ] }, { "cell_type": "code", "execution_count": 17, "id": "c4baeb0d-5284-4065-ba6a-421a1d81840f", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_ratio_map('S_OBLIQUE_AFTER_VERB', title='VX ratio').create_map()" ] }, { "cell_type": "code", "execution_count": 18, "id": "d706d106-63da-4f2e-9089-e2a4ef5dd05a", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Make this Notebook Trusted to load map: File -> Trust Notebook
" ], "text/plain": [ "" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_ratio_map('S_TEND_PREFIX', title='Prefixing ratio').create_map()" ] }, { "cell_type": "markdown", "id": "160c7d66-d845-42ad-aa2f-8924785c8cd7", "metadata": {}, "source": [ "## Exercises\n", "\n", "Look for something that stands out in an area that you are familiar with.\n", "\n", "Discuss in groups or ponder on your own:\n", " - How does the projected word order differ from your expectation?\n", " - Why do you think it is different? Here are some causes that we have found while investigating:\n", " 1. The language in the Bible translation deviates from the variety you are familiar with. If your language is available in [the multilingual Bible site bible.com](https://www.bible.com/) you can try to confirm this.\n", " 2. Details in what the model does or does not take into account. One example is that adjectives are defined using a core set of adjectives (from Dixon) to ensure that we cover languages with a small adjective class, but this overlaps with the set of adjectives in Romance languages that tend to have AdjN order, thus underestimating the prevalence of NAdj.\n", " 3. Your expectations were not very well-founded to begin with. Hedvig and the GramBank team have produced interesting figures on the (dis)agreement between different people with respect to a particular feature when reading the same grammar (and here we are looking at quantitative information on variation, which is usually difficult to find good information about).\n", " 4. Alignment errors during annotation transfer. The effect of this is generally to increase the apparent variation, and is always present to *some* degree, but it can be difficult to estimate the magnitude without a detailed analysis of the raw data.\n", "\n", "If you find anything interesting, feel free to get in touch (robert@ling.su.se)! We are always looking for ways to improve both the feature extraction itself and ways to evaluate the performance of this process." ] } ], "metadata": { "kernelspec": { "display_name": "lingtyp3", "language": "python", "name": "lingtyp3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.2" } }, "nbformat": 4, "nbformat_minor": 5 }