Sylli can also by used as a python module (read the API Documentation). For example, sylli was used to divide a corpus of Italian into syllables. Note that you’ll need a corpus reader to access the data.
In our example case we used a corpus reader integrated with NLTK to access the data in the corpus. First, import the modules, one for the corpus reader and the other for Sylli.
>>> import sylli
>>> from ntlk.corpus import clips
>>>
Now it is possible to query the corpus and syllabify the output. First, we define an object item which contains the ids of a corpus unit, in this case the fifth dialogue of a clips sub-corpus (DG).
>>> item = clips.utteranceids('DG')[5]
>>>
Then, we create an object SylModule and use the method syllabify() to syllabify input string.
>>> item = clips.utteranceids('DG')[5]
>>> syl = SylModule()
>>> syl.syllabify(''.joinclips.phonemes(item))
['ak.kan.to.a.si.nis.tra']
>>>
You can also syllabify each word separately.
>>> item = clips.utteranceids('DG')[5]
>>> syl = SylModule()
>>> for word in clips.phonemes(item):
>>> print syl.syllabify(word)
['ak.kan.to']
['a']
['si.ni.stra']
>>>
Or syllabify a single word.
>>> item = clips.utteranceids('DG')[5]
>>> syl = SylModule()
>>> syl.syllabify(clips.phonemes(item)[0])
['a.kk"an.to']
>>>
You can also load another configuration file.
>>> syl = SylModule()
>>> syl.load_conf('/home/jako/sonority.txt')
>>>
Or specify the the configuration using the object’s attributes.
>>> syl = SylModule()
>>> syl.sonority_file = '/home/jako/sonority.txt'
>>> syl.output = 'cvcv'
>>> syl.extra = 0
>>>syl.syllabify('strada')
CCCV.CV
>>>
Finally, it is possible to display the TIMIT as well as any other information available in the desired layout. For example, this simple code will display the entire sentence, its syllabification, the phonological transcription of each word, the orthographic transcription and its TIMIT by using clips’ corpus reader.
# import splt corpus reader
from nltk.corpus import splt
import sylli
syl = sylli.SylModule()
# all corpus' utterances
item = splt.utteranceids()[0:10]
# for every sentence
for it in item:
print it + ":"
# print the sentence with timit indicators
print splt.word_times(it)
# for every word in the corpus print the phoneme, TIMIT,
# and the ortographic form.
for word, phone in zip(splt.word_times(it), splt.phoneme_times(it)):
print phone, '>', syl.syllabify(phone[0])
Output:
PALERMO/PALERMO/corpusPa/DGmtB03P_p1F#119:
[(u'__%', 0, 1549), (u'per', 1549, 6157), (u'tutto', 6157, 11384), (u'il', 11384, 12275), (u'foglio', 12275, 22420)]
('__%', 0, 1549) >
('per', 1549, 6157) > per
('t"utto', 6157, 11384) > tut.to
('il', 11384, 12275) > il
('f"OLLo', 12275, 22420) > fOL.Lo