Complexity characteristics of punctuation usage patterns in written language

Język naturalny posiada szereg specyficznych cech, które pozwalają traktować go jak układ złożony. Ma on skomplikowaną, hierarchiczną organizację, a właściwości i oddziaływania charakterystyczne dla poszczególnych jego struktur niekoniecznie wynikają wprost z właściwości elementów składających się na te struktury. Tematem pracy jest badanie tych aspektów organizacji języka naturalnego, które w użyteczny sposób można opisywać za pomocą formalizmu stosowanego do opisu układów złożonych. Przedmiotem analizy jest język w formie pisanej - którego próbkę stanowią teksty literackie w kilku językach europejskich (angielskim, niemieckim, francuskim, włoskim, hiszpańskim, polskim i rosyjskim). Pierwszą badaną kwestią są rozkłady potęgowe opisujące częstość występowania słów w tekstach oraz wpływ, jaki na kształt tych rozkładów ma uwzględnienie częstości występowania znaków interpunkcyjnych. Kolejnym zagadnieniem jest reprezentacja tekstów w postaci szeregów czasowych, skonstruowanych w oparciu o podział na zdania lub na fragmenty pomiędzy kolejnymi znakami interpunkcyjnymi. Okazuje się, że szeregi te posiadają cechy często spotykane w sygnałach generowanych przez układy złożone - obecność korelacji długozasięgowych i związanych z nimi odpowiednich struktur fraktalnych lub multifraktalnych. Co więcej, analiza podziału tekstów na fragmenty wyznaczone przez interpunkcję pozwala zaobserwować, że odległości pomiędzy kolejnymi znakami interpunkcyjnymi można opisać za pomocą dyskretnego rozkładu Weibulla; stanowi to pewną statystyczną prawidłowość, której podlegają teksty we wszystkich przebadanych w pracy językach. Ostatnia część rozprawy poświęcona jest sieciom lingwistycznym (sieciom złożonym reprezentującym wybrane aspekty organizacji języka) i koncentruje się na sieciach sąsiedztwa słów - których struktura odzwierciedla współwystępowanie słów w tekstach. Rezultaty badania sieci sąsiedztwa słów wskazują, że wielkości charakteryzujące takie sieci mogą być wykorzystywane do klasyfikacji tekstów, na przykład w rozpoznawaniu autorstwa. Dodatkowo, metody analizy sieci zostały zastosowane do sieci lingwistycznych innego typu, konkretnie do tak zwanych sieci skojarzeń pomiędzy słowami, skonstruowanych w oparciu o dane pochodzące z odpowiednich eksperymentów psycholingwistycznych. W sieciach tych zostały zidentyfikowane złożone struktury, istotnie różne od tych, które można zaobserwować w sieciach przypadkowych. We wszystkich przeprowadzonych analizach dotyczących języka pisanego kluczowym zagadnieniem jest interpunkcja i jej wpływ na mierzalne cechy języka. W analizie częstości występowania słów znaki interpunkcyjne są traktowane jak słowa, co prowadzi do zwiększenia zgodności rozkładu częstości z rozkładem potęgowym, określonym prawem Zipfa. Wzięcie pod uwagę interpunkcji w sieciach sąsiedztwa słów dostarcza użytecznej informacji, której uwzględnienie istotnie poprawia efektywność identyfikacji cech rozróżniających teksty. Z rezultatów analizy szeregów czasowych wynika, że organizacja, jaką do języka wprowadza interpunkcja, ma zarówno właściwości w znacznym stopniu uniwersalne (wspólne dla różnych tekstów), jak i pewne cechy charakterystyczne dla poszczególnych tekstów - na przykład dla tekstów w konkretnym języku.
Natural language has a number of features that allow it to be treated as a complex system. It has a complicated, hierarchical organization, and the properties and interactions appearing in its structures might not be directly deducible from the properties of the elements of which these structures are built. The subject of this thesis is the study of several aspects of natural language organization, namely the ones that can be characterized with the use of the formalism commonly applied in research on complex systems. The study focuses on written language, in the form of literary texts in several European languages (English, German, French, Italian, Spanish, Polish and Russian). The first part of the analysis discusses power-law distributions describing word frequencies in texts and investigates how the shapes of these distributions are changed when the frequencies of punctuation marks are taken into consideration. The next part focuses on representing texts in the form of time series constructed on the basis of text's partition into sentences or into pieces between consecutive punctuation marks. It turns out that these series have features often found in signals generated by complex systems - the presence of long-range correlations along with fractal or multifractal structures. Moreover, the analysis of the partition of texts into pieces determined by punctuation allows to observe that the distances between consecutive punctuation marks can be described using the discrete Weibull distribution; this can be considered a statistical regularity that applies to texts in all the languages studied in the thesis. The last part of the dissertation is devoted to linguistic networks (networks representing selected aspects of language organization) and focuses on word-adjacency networks - whose structure reflects the co-occurrence of words in texts. The results of word-adjacency network analysis indicate that the quantities characterizing such networks can be used for the classification of texts, for example in authorship attribution. In addition, methods routinely used in research on complex networks have been applied to linguistic networks of a different type, namely the so-called word-association networks, which are constructed on the basis of data collected in certain psycholinguistic experiments. This has allowed to reveal complex structures, significantly different from those observed in random networks. In all of the presented analyzes pertaining to written language, punctuation and its impact on the measurable traits of language is the key issue. In the analysis of word frequency distributions, punctuation marks are treated as words - this leads to a better agreement between the observed distributions and power-law distributions proposed by Zipf's law. Introducing punctuation into word-adjacency networks provides valuable information which can be used to significantly improve the effectiveness of identifying features distinguishing one text from another. The results of time series analysis show that the organization that punctuation introduces into language has both largely universal properties (common to many different texts) and certain features characteristic of particular texts - for example, texts in a specific language.

URI

http://rifj.ifj.edu.pl/handle/item/380

License Type

Uznanie autorstwa 4.0 Międzynarodowe

Collections

Rozprawy doktorskie IFJ PAN (Doctoral dissertations of IFJ PAN)

Full item page

Complexity characteristics of punctuation usage patterns in written language

DOI

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Thesis supervisors

Reviewers

Publisher

License Type

Abstract

Description

Keywords

Citation

URI

Sponsorship:

Grantnumber:

License Type

Collections