Date: Thu, 24 Jun 1999 13:08:35 +0200 From: Vaclav Hanzl To: j.wells@ucl.ac.uk Cc: pollak@noel.feld.cvut.cz, cernocky@urel.fee.vutbr.cz, utrrrusk@savba.sk, xbatusek@fi.muni.cz Subject: Czech SAMPA Dear professor Wells, our Speech recognition group at the Czech Technical University in Prague is currently creating Czech Speechdat database. For this purpose we wanted to design Czech SAMPA. However looking at your directory index page http://www.phon.ucl.ac.uk/home/sampa we discovered that there already is Czech SAMPA page, thought not yet linked from your main SAMPA page. We are glad to see that the work is nearly done from the initiative of Robert Bartusek, however we have found that the current proposal neglects some less known but important language facts. Basically the problem is that the current proposal does not meet the requirement stated on your main page: "A SAMPA transcription is designed to be uniquely parsable. As with the ordinary IPA, a string of SAMPA symbols does not require spaces between successive symbols." The most obvious problems arise from /ts/, /tS/ and /ou/, which would be indistinguishable from sequences /t s/, /t S/ and /o u/. The difference is often important in Czech language and cannot be neglected. We propose to change also /au/, /eu/, /dz/, /tS/ and /dZ/. Our detailed reasoning follows. As you propose in your article [Computer-coding the IPA: a proposed extension of SAMPA], such problems can be solved using conjunctor (underscore _) and disjunctor (hyphen -). After careful study of the problem we propose very specific use of those two marks in the transcription of Czech: 1) conjunctor (underscore _) indicates affricate or diphthong 2) disjunctor (hyphen -) indicates sequence of two units 3) no conjunctor, no disjunctor indicates INDETERMINACY 1) x 2) The proposed expression of indeterminacy really makes sense for Czech and can be used in several ways (depending on the needs of SAMPA user): a) to express two orthoepic variants of pronunciation b) to express language reality (variant is incorrect but often used) c) to express lack of knowledge needed for more exact transcription (especially when computer program does the transcription) Thought it might seem now that it gets too complicated and the original proposal using just /ts/ and /tS/ would often be good enough, I would like to stress that this is not the case. These affricates have their own graphical representation in Czech language ('c' and 'c^') and when written like this they are never pronounced as sequences /t s/ or /t S/. Undoubtedly they are separate phonemes and they deserve their own symbols. Unfortunately the most tempting SAMPA transcriptions /c/ and /C/ are not free and so we are forced to accept some more awkward alternative. Being forced to use a complicated form like /t_s/ anyway, we get that possible indeterminacy expression as a free bonus. I hope all this will be clarified by the examples below. We also propose explicit listing of several allophones to unify choice of their SAMPA transcription (thought using these allophones in transcription should be optional) and other minor changes. In the following text we will represent Czech orthography like this: ^ after u indicates ring accent (krouzek) ^ after other characters indicates wedge accent (caron, hacek) / after character indicates acute accent (carka) To make things clear, we will separate SAMPA transcriptions of phones by spaces, thought with our specification spaces are no longer necessary. SAMPA transcriptions are delimited by slashes (/.../) when shown in text. DETAILED EXPLANATION WITH EXAMPLES ================================== Affricate /t_s/ -------------- Should be distinguished from sequence /t - s/. Thought it is true that the sequence 't - s' might sometimes assimilate to the affricate /t_s/, it is still different from it and this difference can be the only one which carries the meaning. For example: ORTHOGRAPHY SAMPA MEANING klacku k l a t_s k u a bludgeon (genitive, singular) Kladsku k l a t - s k u region Kladsko (local case, singular) ra/ce r a: t_s e a breed ra/d se r a: t - s e he likes ve^cem v j e t_s e m a thing (dative, plural) vjet sem v j e t - s e m to drive here inside r^ic^i/ r' i t_S i: he yells r^its^i/ r' i t - S i: more sparse However /t - s/ might also often assimilate to /t_s/ without problems: de^tsky/ d' e t_s k i: childish OR d' e t - s k i: Both forms are correct and even one speaker might use both. We propose transcription /d'etski:/ to express this indeterminacy (while /d'et_ski:/ and /d'et-ski:/ are still possible and describe the particular variant used). Affricate /t_S/ -------------- Very similar situation as above. For example: poc^i/t p o t_S i: t to begin pods^i/t p o t - S i: t to line (to sew) r^ic^i/ r' i t_S i: he yells r^its^i/ r' i t - S i: more sparse ve^ts^i/ v j e t S i: bigger Diphthong /o_u/ --------------- This is a true phoneme, always acknowledged as such without discussion. It is distinct from sequence /o - u/ - it has different pronunciation and all speakers and listeners do distinguish this. The difference can be demonstrated on couples like this one: samouk s a m o - u k autodidact pavouk p a v o_u k spider When pronounced wrong (/s a m o_u k/, /p a v o - u k/) it sounds really strange and people often do not understand the word. There is even minimal couple proudit p r o_u d i t to flow proudit p r o - u d i t to bake well in smoke where the graphical form is the same and the pronunciation selects the meaning. Another way to demonstrate the difference is to ask people how many syllables certain word has. Always /o_u/ makes for one while /o - u/ for two. Czech speakers never use /o_u/ instead of /o - u/ or vice versa. However computer program might not be able to pick the right choice - in which case indeterminacy notation /ou/ might be used. Diphthong /a_u/ --------------- Rather similar situation as with 'o_u'. Both 'a_u' and 'a - u' occur in Czech: auto a_u t o a car nauc^it n a - u t_S i t teach It is however less frequent than 'o_u' and thereby possibly prone to the same problems as described below for 'eu'. Diphthong /e_u/, sequence /e - u/ --------------------------------- Both 'e_u' and 'e - u' occur in the language (thought they are rare), in the same way as described above for 'o_u', 'o - u' and 'a_u', 'a - u'. However pronunciation is very uncertain. There are Czech composite words: neusta/le n e - u s t a: l e all the time neutucha/ n e - u t u x a: does not stop where the composition seam should prevent the diphthong and there are words of foreign origin: neuron n e_u r o n neuron neutron n e - u t r o n neutron pneumatika p n e u m a t i k a tyre where the correct pronunciation depends on the word's origin, which is generally unknown to the speaker, or even on the fact that composition seem was created far enough in the word's history. For some words, the right pronunciation is /e_u/, for others /e - u/ and for the rest both. In reality, people use both in nearly all cases (syllable counting test shows this clearly). In most cases, indeterminacy transcription /eu/ might be the best one. However in some words the diphthong is indisputable, for example: euforie e_u f o r i e euphoria Affricates /d_z/, /d_Z/ ----------------------- Rather similar to /t_s/ and /t_S/ (which are their voiceless counterparts). However /d_z/ and /d_Z/ are much less frequent and they even do not have their own Czech graphemes. Despite the lack of distinct graphemes, the affricate pronunciation might be the only one: leckdy l e d_z g d i at times (here the voiced /d/ caused assimilation /k/ -> /g/ and in turn /t_s/ -> /d_z/; one grapheme 'c' corresponds to /d_z/ which cannot be split into /d - z/) dz^i/ny d_Z i: n i jeans (only tired speaker thinking too much about affricates might ever say /d - Z i: n i/, under normal circumstances it is impossible) On a word seam, indeterminacy is possible: nadz^ivotni: n a d_Z i v o t n i: above lifesize OR n a d - Z i v o t n i: and again we would transcribe it /nadZivotni:/. ALLOPHONES TO BE ADDED TO THE LIST ================================== We propose to add these allophones to the list (and thereby unify possible future Czech SAMPA practice when these allophones are used): PHONE ALLOPHONE OF ORTOGRAPHY TRANSCRIPTION MEANING F m tramvaj t r a F v a j tram N n banka b a N k a bank G x abych byl a b i G b i l so as I am r'_0 r' ker^ k e r'_0 bush r'_v r' zr^i/dka z r'_v i: t k a seldom SYLLABIC CONSONANTS =================== Syllabic consonants /l=/, /r=/ and /m=/ are nearly identical with non-syllabic /l/, /r/ and /m/. We propose to make the /=/ mark optional only. DIPHTHONGS ENDING WITH /_j/ =========================== We suppose there are no problems similar to those with /ou/, /au/ and /eu/. Diphthongs like /aj/ are completely determined by the surrounding phones; there is no word-level information which should be stored in the SAMPA transcription (like there is in our example with 'pavouk' and 'samouk'). Therefore we propose to keep transcription without any /_/ or /-/ as the basic one; in fact the very existence of these diphthongs may be ignored - often we can suppose them to be sequences like /a j/ where /a/ is influenced by /j/ and /j/ by /a/. Should somebody find the difference important, we propose consistent use of /-/ and /_/ to express it. OTHER SAMPA CHANGES (TO BE AVOIDED) =================================== To be honest, I must confess that I know about further possible modifications of Czech SAMPA which would make it a bit more exact. They would however make Czech SAMPA much more awkward for a Czech user and so WE DO NOT WISH TO MAKE THESE CHANGES: use instead of -------------------------------- E e U u c t' J\ d' J n' There are also very tempting changes which would make SAMPA less awkward for a Czech user. They would however decrease the phonetical precision too much and therefore WE ALSO DO NOT WISH TO MAKE THESE CHANGES: use instead of -------------------------------- h h\ M F c t_s C t_S R r'_0 ********************** S U M M A R Y **************************** To summarize our proposal, here is the complete Czech SAMPA with our changes: Vowels: ------- i mys^ miS mouse e les les forest a pas pas passport o rok rok year u kus kus piece i: pi/t pi:t to drink e: le/k le:k drug a: ra/d ra:t glad o: mo/da mo:da fashion u: pu^l pu:l half diphthongs: a_u auto a_uto car e_u euforie e_uforie euphoria o_u mouka mo_uka flour Sequences of two vowels are distinguished by a dash (-): a-u nauc^it na-ut_Sit teach When there is diphthong / two vowels indeterminacy, non of /-/ and /_/ is used: eu pneumatika pneumatika tyre (e_u OR e-u) Consonants: ----------- plosives: p pes pes dog b bota bota shoe t tam tam there d du/m du:m house t' tito t'ito these d' de^d d'et grandfather k krk krk neck g kde gde where affricates: t_s ci/l t_si:l aim d_z leckdy led_zgdi at times t_S c^as t_Sas time d_Z dz^ba/n d_Zba:n jug Treatment of sequences and indeterminacies is the same as for diphthongs: dz podzim podzim autumn (d_z OR d-z) fricatives: f forma forma form v vak vak bag s sen sen dream z zub zup tooth r' r^ek r'ek Greek S s^a/l Sa:l scarf Z z^al Zal regret j boj boj fight x chlapec xlapet_s boy h\ had h\at snake liquids: r ret ret lip l led let ice nasals: m mrak mrak cloud n noc not_s night n' nic n'it_s nothing OPTIONAL CZECH SAMPA NOTATION: ============================== Marking syllabic consonants with /=/: l= vlk vl=k wolf m= osm osm= eight r= krk kr=k neck Using some allophones: PHONE ALLOPHONE OF ORTOGRAPHY TRANSCRIPTION MEANING F m tramvaj traFvaj tram N n banka baNka bank G x abych byl abiGbil so as I am r'_0 r' ker^ ker'_0 bush r'_v r' zr^i/dka zr'_vi:tka seldom /F/, /N/ and /G/ are used together with the basic phones (some occurences of the basic form are replaced by the allophone) while the allophones /r'_0/ and /r'_v/ do replace all the occurences of the basic phone /r/. *********** E N D O F T H E S U M M A R Y ***************** In general, Czech SAMPA is bound to be a bit special and awkward - this is due to IPA itself which is not very well suited for Czech language. Czech phoneticians usually devote one page in their books to explanation why they will not use IPA and these pages are the only place where they use any IPA symbols. Czech orthography itself is close to a phonetic (or phonemic) transcription and only minor changes are needed to create suitable complete and exact transcription system. The same holds true for the computer related work - Czech character sets are generally available here (for example ISO 8859-2) and we usually use transcription employing just lowercase and uppercase letters (including accented ones), using just one character per phone (or phoneme). Czech SAMPA is likely to be used for compatibility reasons only and bigger pieces of Czech SAMPA will probably be created by an automatical conversion from our 'national' transcription system. Therefore we can live with an 'awkward' Czech SAMPA. Nevetherless we still beleive our effort will help to find an acceptable solution for the Czech SAMPA transcription. Dear professor Wells, thank you very much for your attention and for any comments you might have to this proposal. Of course comments are also welcome from everybody who receives a copy. And please excuse my English, I know it is rather an Internet pidgin than a proper language. With the best regards Yours sincerely Vaclav Hanzl +-----------------------------------------------------------------------+ | Czech Technical University in Prague fax: (+420 2) 243 10 784 | | Faculty of Electrical Engineering, K331 or (+420 2) 311 1786 | | Technicka 2 | | 166 27 Prague 6, Czech Republic email: hanzl@noel.feld.cvut.cz | | http://amber.feld.cvut.cz/user/Hanzl | +-----------------------------------------------------------------------+