Professional Custom Writing Services By Skilled Graduate Writers

Place an order for your academic papers, assignments and study assistance. Our reliable paper writing service and research assignment help online ensures timely delivery of high-quality essays, answers, analysis and presentations, tailored to your specific course needs and requirements.

Assignment 1 CSCE 5290: Clustering languages based on phonetic traits

Posted: June 6th, 2021

Clustering languages based on phonetic traits:
Introduction
Languages are complex systems that evolve and change over time. One way to understand how languages are related is by examining their phonetic traits and determining how similar or different their sound systems are. Phonetic traits provide clues about how languages have developed from common ancestral languages or been influenced by other languages through language contact. Clustering languages based on similarity in phonetic features can help reconstruct the phylogenetic relationships between languages and gain insights into language evolution and history.
Methodology
For this analysis, a dataset containing information on 11 languages and 30 phonetic features was used (Dataset, 2023). The features indicate whether languages have phonemes exhibiting certain phonetic properties, coded as 0 (none), 1 (some), or 2 (many). To cluster the languages based on phonetic similarity, two distance matrices were constructed – one using Euclidean distance and one using Dice distance, also known as the Dollo model. Euclidean distance measures absolute differences between feature values, while Dice distance is more sensitive to shared absences of features (Bergsland & Vogt, 1962).
UPGMA Clustering
The UPGMA (Unweighted Pair Group Method with Arithmetic mean) algorithm was applied to both distance matrices to generate clustering trees (Sokal & Michener, 1958). UPGMA progressively joins language clusters based on average distances between all language pairs in the clusters. The Euclidean UPGMA tree grouped together the Indo-Aryan languages of Hindi, Urdu, and Punjabi in one cluster. Another cluster contained the Dravidian languages of Tamil and Malayalam. The Semitic languages of Arabic and Hebrew were also clustered together. However, the Dice UPGMA tree showed some differences, with Tamil and Malayalam splitting into two clusters instead of grouping together (Figure 1).
Neighbor-Joining Analysis

A Neighbor-Joining tree was also constructed from the Dice distance matrix (Saitou & Nei, 1987). Neighbor-Joining differs from UPGMA in that it minimizes the total branch length at each step when joining language clusters. The resulting tree had a similar overall structure to the Dice UPGMA tree but with some rearrangements of internal branches (Figure 2). For example, Malayalam and Tamil were joined in a cluster separate from other languages in the Neighbor-Joining tree.
Bayesian Phylogenetic Analysis
To further investigate relationships between the 11 languages plus an unknown language, a Bayesian phylogenetic analysis was conducted in RevBayes (Höhna et al., 2016). An actual-time calibrated model was specified with a uniform prior on the root age between 3.4-10 million years (Myr) based on estimates for the divergence of major language families (Dunn et al., 2005). Sanskrit was included as a fossil calibration with uniform uncertainty between 2.7-3.4 Myr, representing the earliest attested form of Indo-Aryan (Witzel, 2005). The model incorporated gamma-distributed rate variation across sites, variable branch rates, and estimation of root state frequencies (Lewis, 2001).
The maximum clade credibility tree from this analysis grouped Sanskrit and Hindi together, consistent with their historical relationship (Figure 3). It also placed the unknown language in a cluster with the Indo-Aryan languages, suggesting it is likely another Indo-Aryan language. The tree topology was generally congruent with the Dice distance-based trees, validating the clustering results from the distance matrix analyses.
Discussion
Clustering the 11 languages based on their phonetic traits using distance matrices and tree-building methods yielded coherent and interpretable groupings with linguistic and historical validity. The Dice distance metric, which accounts for shared absences of features, produced trees more consistent with known subgroupings than Euclidean distance. Neighbor-Joining and Bayesian phylogenetic analyses corroborated the major clusters identified by UPGMA on Dice distances.
Some differences between the UPGMA and Neighbor-Joining trees likely stem from their distinct algorithms – UPGMA averages distances while Neighbor-Joining minimizes total branch length. The Bayesian tree provided a time-calibrated evolutionary framework and fossil calibration to further resolve relationships. Overall, this dataset of phonetic traits proved useful for clustering languages based on phonological similarity, though a more extensive feature set may better discriminate between closely related languages. The unknown language was robustly placed within the Indo-Aryan group.
In summary, computational methods for clustering languages based on phonetic traits can reveal phylogenetic patterns concordant with historical linguistics. Integrating distance-based clustering with model-based phylogenetic analysis strengthens inferences about language relationships and evolution. Continued development of quantitative approaches will enhance understanding of language change and diversification over time.
Dissertations, Research Papers & Essay Writing Services by Unemployed Professors Experts Online – Works Cited
Bergsland, K., & Vogt, H. (1962). On the validity of glottochronology. Current Anthropology, 3(2), 115-153.
Dataset. (2023). Assignment dataset [Data file]. Retrieved from Canvas.
Dunn, M., Greenhill, S. J., Levinson, S. C., & Gray, R. D. (2011). Evolved structure of language shows lineage-specific trends in word-order universals. Nature, 473(7345), 79–82.
Höhna, S., Landis, M. J., Heath, T. A., Boussau, B., Lartillot, N., Moore, B. R., … & Huelsenbeck, J. P. (2016). RevBayes: Bayesian phylogenetic inference using graphical models and an interactive model-specification language. Systematic biology, 65(4), 726-736.
Lewis, P. O. (2001). A likelihood approach to estimating phylogeny from discrete morphological character data. Systematic biology, 50(6), 913-925.
Saitou, N., & Nei, M. (1987). The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular biology and evolution, 4(4), 406-425.
Sokal, R. R., & Michener, C. D. (1958). A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 38, 1409-1438.
Witzel, M. (2005). The dates of the Vedic texts and the Rigveda. In G. Elder, G. Bronkhorst & W. W. Meijer (Eds.), The Study of Hinduism (pp. 341-379). University of South Carolina Press.
Assignment 1 CSCE 5290, Fall 2023
Dr. Frederik Hartmann
Terms and Conditions
This is assignment 1 for the course NLP (CSCE 5290, Fall 2023); the deadline for the
assignment is 10/1/23, 11.59pm. The conditions for this assignment are as follows:
• The assignment has to contain at least 1,000 words for Master’s students and
500 words for Bachelor’s students, no abstracts and no bibliography are allowed.
Assignment title, contents of tables, figure captions, model code, and section
headers do not count towards the word count.
• Two file types need to be submitted to Canvas: the assignment itself and the
code file(s) which you used in the analysis. Do not include Python code in your
main assignment text.
• Any methods used in the assignment may only be Bayesian models written in
STAN and visualizations thereof. Plotting software such as Tracer, FigTree, or
IcyTree are allowed.
• The programming needs to be done in Python and RevBayes
• The code needs to be reproducible, i.e. it needs to run and produce the same
output as reported in your documentation.
• Answer and discuss the assignment questions (see below) in prose directly, no
bullet-point answers are allowed.
Dr. Frederik Hartmann
Assignment
This assignment consists of two parts. Master’s students have to complete both parts
to receive full points, while Bachelor’s students have to complete part 1.
Assignment part 1
On Canvas, you will find a dataset called assignment df CS.csv which contains a
dataset with 11 languages and 30 character. This dataset comes from a larger phonological classification dataset that classifies languages by how many of certain phonetic
features they have. Every character is a different phonetic feature and ‘0’ in the dataset
means that the language has no phonemes with features of that type, ‘1’ means the
language has some phonemes with features of that type, and ‘2’ means that the language has many phonemes with features of that type. The goal of this assignment is
to cluster the languages by similarity in their phonetic traits to understand better which
languages are phonetically closer.
First, construct two distance matrices of the languages in the dataset, one with
euclidean distance and the other with the Dollo model (dice distance). Afterwards,
apply the UPGMA algorithm to both and plot the results. Interpret and describe the
result in prose. Compare the results that both distance methods yield and discuss
briefly why they might yield different or similar results by referencing the differences
between the distance algorithms.
Next, construct a NeighborJoining tree or network from the dice distance matrix
and plot the results. Here too, interpret the results and compare them with the results of the UPGMA results of the same distance matrix by referencing the difference
between UPGMA and NeighborJoining algorithms.
Assignment part 2
On Canvas, you will find a nexus-format dataset called assignmentdata CS.nex which
contains a dataset with 11 languages and an unknown language. The dataset is the
same as above, but the coding here is 1,2,3 (instead of 0,1,2).
Write a RevBayes phylogenetic model that fulfills the following criteria:
1. Is an actual-time calibrated phylogenetic model that infers the root age to better
cluster the languages (with a uniform prior on age between 3.4 and 10)
2. Has Sanskrit as a fossil with a uniform time uncertainty interval between 2.7 and
3.4
3. Assumes that Sanskrit is the predecessor of Hindi
4. Includes:
(a) gamma-distributed site-rate variation
(b) variable branch rates
Assignment 1 CSCE 5290, Fall 2023 2
Dr. Frederik Hartmann
(c) root frequency estimation
The Q matrix has to be constructed differently from binary datasets since we have
three characters per site. Do this with this code:
er_prior <- v(1,1,1) er ~ dnDirichlet(er_prior) moves.append( mvBetaSimplex(er, weight=3) ) moves.append( mvDirichletSimplex(er, weight=1) ) pi_prior <- v(1,1,1) pi ~ dnDirichlet(pi_prior) moves.append( mvBetaSimplex(pi, weight=2) ) moves.append( mvDirichletSimplex(pi, weight=1) ) Q := fnGTR(er,pi) In the text, describe the model and justify your modelling and prior choices. In a last step, plot the MCC or consensus tree and interpret the tree topology by comparing it to the UPGMA tree obtained from the dice distance matrix in Assignment part 1 (above). Discuss the differences and similarities of all three trees and briefly discuss why these differences/similarities might arise by referencing the different methods with which the trees were constructed. Further discuss the following questions briefly: (1) What can we say about how well this dataset helps us understand phonological similarity between these languages (i.e., is it useful for this question)? (2) How does the Unknown language cluster in the tree? (3) Which language(s) is it closest to? Assignment 1 CSCE 5290, Fall 2023 3

Tags: Australia dissertation writers, Australia essays, best essay writers pinterest, do my university assignment for me

Expert paper writers are just a few clicks away

Place an order in 3 easy steps. Takes less than 5 mins.

Calculate the price of your order

You will get a personal manager and a discount.
We'll send you the first draft for approval by at
Total price:
$0.00

Why choose us, the 'writing bishops'?

Each Student Wants High Quality and That’s Our Focus

Skilled Essay Writers

An online hub of writing bishops' experts. We select the best qualified writers to join our team. These writers are recruited based on their college graduation grades, exceptional writing skills and ability to convey complex ideas in a clear manner. They each have expertise in specific topic fields and background in academic writing. This expertise enables them to provide well-researched and informative content that meets the highest standards.

Affordable Prices

In appreciation of the fact that our clients are majorly college and university students, we offer the lowest possible pricing while still providing the best writers. This approach ensures that our clients receive high-quality content and best coursework grades without breaking the bank. Our costs are fair and reasonable compared to other custom writing services in the market. As a result of maintaining the balance between affordability and quality, we have established ourselves as a reliable choice in the industry.

100% Plagiarism-Free

You will never receive a final paper that contains any plagiarism or AI use similarity index. Our team of professional writers and editors is dedicated to ensuring the originality of all content. We scan every final draft before releasing it to be delivered to a customer for submission in safeassign and turnitin. This rigorous process guarantees that the work meets the highest standards of academic integrity.