-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.html
305 lines (283 loc) · 39.5 KB
/
README.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
<!DOCTYPE html><html><head>
<title>README</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="file:///c:\Users\acha\.vscode\extensions\shd101wyy.markdown-preview-enhanced-0.8.10\crossnote\dependencies\katex\katex.min.css">
<style>
code[class*=language-],pre[class*=language-]{color:#333;background:0 0;font-family:Consolas,"Liberation Mono",Menlo,Courier,monospace;text-align:left;white-space:pre;word-spacing:normal;word-break:normal;word-wrap:normal;line-height:1.4;-moz-tab-size:8;-o-tab-size:8;tab-size:8;-webkit-hyphens:none;-moz-hyphens:none;-ms-hyphens:none;hyphens:none}pre[class*=language-]{padding:.8em;overflow:auto;border-radius:3px;background:#f5f5f5}:not(pre)>code[class*=language-]{padding:.1em;border-radius:.3em;white-space:normal;background:#f5f5f5}.token.blockquote,.token.comment{color:#969896}.token.cdata{color:#183691}.token.doctype,.token.macro.property,.token.punctuation,.token.variable{color:#333}.token.builtin,.token.important,.token.keyword,.token.operator,.token.rule{color:#a71d5d}.token.attr-value,.token.regex,.token.string,.token.url{color:#183691}.token.atrule,.token.boolean,.token.code,.token.command,.token.constant,.token.entity,.token.number,.token.property,.token.symbol{color:#0086b3}.token.prolog,.token.selector,.token.tag{color:#63a35c}.token.attr-name,.token.class,.token.class-name,.token.function,.token.id,.token.namespace,.token.pseudo-class,.token.pseudo-element,.token.url-reference .token.variable{color:#795da3}.token.entity{cursor:help}.token.title,.token.title .token.punctuation{font-weight:700;color:#1d3e81}.token.list{color:#ed6a43}.token.inserted{background-color:#eaffea;color:#55a532}.token.deleted{background-color:#ffecec;color:#bd2c00}.token.bold{font-weight:700}.token.italic{font-style:italic}.language-json .token.property{color:#183691}.language-markup .token.tag .token.punctuation{color:#333}.language-css .token.function,code.language-css{color:#0086b3}.language-yaml .token.atrule{color:#63a35c}code.language-yaml{color:#183691}.language-ruby .token.function{color:#333}.language-markdown .token.url{color:#795da3}.language-makefile .token.symbol{color:#795da3}.language-makefile .token.variable{color:#183691}.language-makefile .token.builtin{color:#0086b3}.language-bash .token.keyword{color:#0086b3}pre[data-line]{position:relative;padding:1em 0 1em 3em}pre[data-line] .line-highlight-wrapper{position:absolute;top:0;left:0;background-color:transparent;display:block;width:100%}pre[data-line] .line-highlight{position:absolute;left:0;right:0;padding:inherit 0;margin-top:1em;background:hsla(24,20%,50%,.08);background:linear-gradient(to right,hsla(24,20%,50%,.1) 70%,hsla(24,20%,50%,0));pointer-events:none;line-height:inherit;white-space:pre}pre[data-line] .line-highlight:before,pre[data-line] .line-highlight[data-end]:after{content:attr(data-start);position:absolute;top:.4em;left:.6em;min-width:1em;padding:0 .5em;background-color:hsla(24,20%,50%,.4);color:#f4f1ef;font:bold 65%/1.5 sans-serif;text-align:center;vertical-align:.3em;border-radius:999px;text-shadow:none;box-shadow:0 1px #fff}pre[data-line] .line-highlight[data-end]:after{content:attr(data-end);top:auto;bottom:.4em}html body{font-family:'Helvetica Neue',Helvetica,'Segoe UI',Arial,freesans,sans-serif;font-size:16px;line-height:1.6;color:#333;background-color:#fff;overflow:initial;box-sizing:border-box;word-wrap:break-word}html body>:first-child{margin-top:0}html body h1,html body h2,html body h3,html body h4,html body h5,html body h6{line-height:1.2;margin-top:1em;margin-bottom:16px;color:#000}html body h1{font-size:2.25em;font-weight:300;padding-bottom:.3em}html body h2{font-size:1.75em;font-weight:400;padding-bottom:.3em}html body h3{font-size:1.5em;font-weight:500}html body h4{font-size:1.25em;font-weight:600}html body h5{font-size:1.1em;font-weight:600}html body h6{font-size:1em;font-weight:600}html body h1,html body h2,html body h3,html body h4,html body h5{font-weight:600}html body h5{font-size:1em}html body h6{color:#5c5c5c}html body strong{color:#000}html body del{color:#5c5c5c}html body a:not([href]){color:inherit;text-decoration:none}html body a{color:#08c;text-decoration:none}html body a:hover{color:#00a3f5;text-decoration:none}html body img{max-width:100%}html body>p{margin-top:0;margin-bottom:16px;word-wrap:break-word}html body>ol,html body>ul{margin-bottom:16px}html body ol,html body ul{padding-left:2em}html body ol.no-list,html body ul.no-list{padding:0;list-style-type:none}html body ol ol,html body ol ul,html body ul ol,html body ul ul{margin-top:0;margin-bottom:0}html body li{margin-bottom:0}html body li.task-list-item{list-style:none}html body li>p{margin-top:0;margin-bottom:0}html body .task-list-item-checkbox{margin:0 .2em .25em -1.8em;vertical-align:middle}html body .task-list-item-checkbox:hover{cursor:pointer}html body blockquote{margin:16px 0;font-size:inherit;padding:0 15px;color:#5c5c5c;background-color:#f0f0f0;border-left:4px solid #d6d6d6}html body blockquote>:first-child{margin-top:0}html body blockquote>:last-child{margin-bottom:0}html body hr{height:4px;margin:32px 0;background-color:#d6d6d6;border:0 none}html body table{margin:10px 0 15px 0;border-collapse:collapse;border-spacing:0;display:block;width:100%;overflow:auto;word-break:normal;word-break:keep-all}html body table th{font-weight:700;color:#000}html body table td,html body table th{border:1px solid #d6d6d6;padding:6px 13px}html body dl{padding:0}html body dl dt{padding:0;margin-top:16px;font-size:1em;font-style:italic;font-weight:700}html body dl dd{padding:0 16px;margin-bottom:16px}html body code{font-family:Menlo,Monaco,Consolas,'Courier New',monospace;font-size:.85em;color:#000;background-color:#f0f0f0;border-radius:3px;padding:.2em 0}html body code::after,html body code::before{letter-spacing:-.2em;content:'\00a0'}html body pre>code{padding:0;margin:0;word-break:normal;white-space:pre;background:0 0;border:0}html body .highlight{margin-bottom:16px}html body .highlight pre,html body pre{padding:1em;overflow:auto;line-height:1.45;border:#d6d6d6;border-radius:3px}html body .highlight pre{margin-bottom:0;word-break:normal}html body pre code,html body pre tt{display:inline;max-width:initial;padding:0;margin:0;overflow:initial;line-height:inherit;word-wrap:normal;background-color:transparent;border:0}html body pre code:after,html body pre code:before,html body pre tt:after,html body pre tt:before{content:normal}html body blockquote,html body dl,html body ol,html body p,html body pre,html body ul{margin-top:0;margin-bottom:16px}html body kbd{color:#000;border:1px solid #d6d6d6;border-bottom:2px solid #c7c7c7;padding:2px 4px;background-color:#f0f0f0;border-radius:3px}@media print{html body{background-color:#fff}html body h1,html body h2,html body h3,html body h4,html body h5,html body h6{color:#000;page-break-after:avoid}html body blockquote{color:#5c5c5c}html body pre{page-break-inside:avoid}html body table{display:table}html body img{display:block;max-width:100%;max-height:100%}html body code,html body pre{word-wrap:break-word;white-space:pre}}.markdown-preview{width:100%;height:100%;box-sizing:border-box}.markdown-preview ul{list-style:disc}.markdown-preview ul ul{list-style:circle}.markdown-preview ul ul ul{list-style:square}.markdown-preview ol{list-style:decimal}.markdown-preview ol ol,.markdown-preview ul ol{list-style-type:lower-roman}.markdown-preview ol ol ol,.markdown-preview ol ul ol,.markdown-preview ul ol ol,.markdown-preview ul ul ol{list-style-type:lower-alpha}.markdown-preview .newpage,.markdown-preview .pagebreak{page-break-before:always}.markdown-preview pre.line-numbers{position:relative;padding-left:3.8em;counter-reset:linenumber}.markdown-preview pre.line-numbers>code{position:relative}.markdown-preview pre.line-numbers .line-numbers-rows{position:absolute;pointer-events:none;top:1em;font-size:100%;left:0;width:3em;letter-spacing:-1px;border-right:1px solid #999;-webkit-user-select:none;-moz-user-select:none;-ms-user-select:none;user-select:none}.markdown-preview pre.line-numbers .line-numbers-rows>span{pointer-events:none;display:block;counter-increment:linenumber}.markdown-preview pre.line-numbers .line-numbers-rows>span:before{content:counter(linenumber);color:#999;display:block;padding-right:.8em;text-align:right}.markdown-preview .mathjax-exps .MathJax_Display{text-align:center!important}.markdown-preview:not([data-for=preview]) .code-chunk .code-chunk-btn-group{display:none}.markdown-preview:not([data-for=preview]) .code-chunk .status{display:none}.markdown-preview:not([data-for=preview]) .code-chunk .output-div{margin-bottom:16px}.markdown-preview .md-toc{padding:0}.markdown-preview .md-toc .md-toc-link-wrapper .md-toc-link{display:inline;padding:.25rem 0}.markdown-preview .md-toc .md-toc-link-wrapper .md-toc-link div,.markdown-preview .md-toc .md-toc-link-wrapper .md-toc-link p{display:inline}.markdown-preview .md-toc .md-toc-link-wrapper.highlighted .md-toc-link{font-weight:800}.scrollbar-style::-webkit-scrollbar{width:8px}.scrollbar-style::-webkit-scrollbar-track{border-radius:10px;background-color:transparent}.scrollbar-style::-webkit-scrollbar-thumb{border-radius:5px;background-color:rgba(150,150,150,.66);border:4px solid rgba(150,150,150,.66);background-clip:content-box}html body[for=html-export]:not([data-presentation-mode]){position:relative;width:100%;height:100%;top:0;left:0;margin:0;padding:0;overflow:auto}html body[for=html-export]:not([data-presentation-mode]) .markdown-preview{position:relative;top:0;min-height:100vh}@media screen and (min-width:914px){html body[for=html-export]:not([data-presentation-mode]) .markdown-preview{padding:2em calc(50% - 457px + 2em)}}@media screen and (max-width:914px){html body[for=html-export]:not([data-presentation-mode]) .markdown-preview{padding:2em}}@media screen and (max-width:450px){html body[for=html-export]:not([data-presentation-mode]) .markdown-preview{font-size:14px!important;padding:1em}}@media print{html body[for=html-export]:not([data-presentation-mode]) #sidebar-toc-btn{display:none}}html body[for=html-export]:not([data-presentation-mode]) #sidebar-toc-btn{position:fixed;bottom:8px;left:8px;font-size:28px;cursor:pointer;color:inherit;z-index:99;width:32px;text-align:center;opacity:.4}html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] #sidebar-toc-btn{opacity:1}html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc{position:fixed;top:0;left:0;width:300px;height:100%;padding:32px 0 48px 0;font-size:14px;box-shadow:0 0 4px rgba(150,150,150,.33);box-sizing:border-box;overflow:auto;background-color:inherit}html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc::-webkit-scrollbar{width:8px}html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc::-webkit-scrollbar-track{border-radius:10px;background-color:transparent}html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc::-webkit-scrollbar-thumb{border-radius:5px;background-color:rgba(150,150,150,.66);border:4px solid rgba(150,150,150,.66);background-clip:content-box}html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc a{text-decoration:none}html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc .md-toc{padding:0 16px}html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc .md-toc .md-toc-link-wrapper .md-toc-link{display:inline;padding:.25rem 0}html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc .md-toc .md-toc-link-wrapper .md-toc-link div,html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc .md-toc .md-toc-link-wrapper .md-toc-link p{display:inline}html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] .md-sidebar-toc .md-toc .md-toc-link-wrapper.highlighted .md-toc-link{font-weight:800}html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] .markdown-preview{left:300px;width:calc(100% - 300px);padding:2em calc(50% - 457px - 300px / 2);margin:0;box-sizing:border-box}@media screen and (max-width:1274px){html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] .markdown-preview{padding:2em}}@media screen and (max-width:450px){html body[for=html-export]:not([data-presentation-mode])[html-show-sidebar-toc] .markdown-preview{width:100%}}html body[for=html-export]:not([data-presentation-mode]):not([html-show-sidebar-toc]) .markdown-preview{left:50%;transform:translateX(-50%)}html body[for=html-export]:not([data-presentation-mode]):not([html-show-sidebar-toc]) .md-sidebar-toc{display:none}
/* Please visit the URL below for more information: */
/* https://shd101wyy.github.io/markdown-preview-enhanced/#/customize-css */
</style>
<!-- The content below will be included at the end of the <head> element. --><script type="text/javascript">
document.addEventListener("DOMContentLoaded", function () {
// your code here
});
</script></head><body for="html-export">
<div class="crossnote markdown-preview ">
<p><a href="https://classroom.github.com/a/O2G_bdRs"><img src="https://classroom.github.com/assets/deadline-readme-button-24ddc0f5d75046c5622901739e7c5dd533143b0c8e959d652212380cedb1ea36.svg" alt="Review Assignment Due Date"></a></p>
<h1 id="mpcs-51042-python-programming-final-project">MPCS 51042 Python Programming Final Project </h1>
<h2 id="introduction">Introduction </h2>
<p>In this final project, you will develop a modeling system using Markov Models.<br>
You don’t have to understand them completely (or at all) for this assignment but here are some helpful slides if you are curious:</p>
<p><a href="http://cecas.clemson.edu/~ahoover/ece854/refs/Ramos-Intro-HMM.pdf">http://cecas.clemson.edu/~ahoover/ece854/refs/Ramos-Intro-HMM.pdf</a></p>
<p>Markov models can be used to capture the statistical relationships present in a language like English. These models allow us to go beyond simplistic observations, like the frequency with which specific letters or words appear in a language, and instead to capture the relationships between words or letters in sequences. As a result, we can not only appreciate that the letter “q” appears in text at a certain rate, but also that it is virtually always followed by a “u.” Similarly, “to be” is a much more likely word sequence than “to is.”</p>
<p>One application of Markov models is in analyzing actual text and assessing the likelihood that a particular person uttered it. That is one objective of this assignment.</p>
<p>The other objective for this assignment is to create a hash tables module. Hash tables are data structures that store associations between keys and values (exactly like dictionaries in Python) and provide an efficient means of looking up the value associated with a given key. Hash tables find a desired entry rapidly by limiting the set of places where it can be. They avoid “hot spots,” even with data that might otherwise all seem to belong in the same place, by dispersing the data through hashing.</p>
<p>Apart from developing an appreciation of how hashing and hash tables work, you will also be better prepared if ever you need to write your own hash table in the future. For instance, you may use a language (like C) that does not feature a built-in hash table. Or, the hash table that is used by a language, like Python, may interact poorly with your particular data, obligating you to design a custom-tailored one. After completing this assignment, you should consider hash tables to be in your programming repertoire.</p>
<h2 id="part-1-hash-table-with-separate-chaining">Part 1: Hash Table with Separate Chaining </h2>
<p>In the interest of building and testing code one component at a time, you will start by building a hash table. Once you have gotten this module up and running, you will use it in your construction of a Markov model for speaker attribution.</p>
<p>There are different types of hash tables; for this assignment, we will use the type that is implemented with <strong>separate chaining</strong>.</p>
<p>Please look at <code>proj/hashtable.py</code>. You will modify this file and must implement a hash table using the separate chaining algorithm. The class <code>Hashtable</code>:</p>
<ul>
<li>Implement the required dunder methods in <code>Hashtable</code>.</li>
<li>You may assume all keys are strings.</li>
<li>Define an <code>__init__(self, capacity, default_value, load_factor, growth_factor)</code> constructor:
<ul>
<li><code>capacity</code> - initial number of cells to use. It must create a list of empty cells of the specified length. (You may assume this is always greater than zero.)</li>
<li><code>default_value</code> - value to return when looking up a key that has not been inserted.</li>
<li><code>load_factor</code> - A floating point number (0, 1]. If the fraction of occupied cells grows beyond the <code>load_factor</code> after an update, then you must perform a rehashing of the table. (See below.)</li>
<li><code>growth_factor</code> - An integer (>1) that represents how much to grow the table by when rehashing. For example, if <code>growth_factor = 2</code> then the size of the hash table would double on each rehashing.</li>
</ul>
</li>
<li>It must use a <strong>single</strong> Python <code>list</code> (the already-defined <code>self._items</code>) to hold the key and value in the table. Each non-empty cell of this list will hold a class of your creation (see "Separate Chaining" below).</li>
<li>We have provided a <code>self._hash</code> function that should be used, do not use the built in <code>hash()/__hash__</code>.</li>
<li>You are free to implement additional properties, attributes, and methods as needed to implement this class.</li>
</ul>
<h3 id="separate-chaining">Separate Chaining </h3>
<p>Separate chaining is a method of collision resolution in hash tables. When a collision occurs, the new item is added to the end of a linked list. This allows for multiple items to be stored in the same cell. The <code>Hashtable</code> class will use separate chaining to resolve collisions. You will need to implement a linked list class to use for the purpose of storing your items. The design and testing of this class is up to you.</p>
<p>You may not use built-in types like <code>list</code> or <code>deque</code> for this, but should instead write your own implementation.</p>
<p>A diagram of a hash table with separate chaining is shown below:</p>
<p><img src="img/hashtable.png" alt=""></p>
<h3 id="rehashing">Rehashing </h3>
<p>A hash table built with chaining theoretically has no upper bound on the number of items it can store. That said, as the number of items grows, the number of collisions will grow as well. This will slow down the performance of the hash table. To avoid this, we will rehash the table when it becomes too full. We will use the load factor to determine when to rehash.</p>
<p>The cell capacity passed into the <code>__init__</code> will represent the initial size of the table.</p>
<p>If the fraction of occupied cells grows beyond <code>load_factor</code> after an update, then we will perform an operation called rehashing:</p>
<ul>
<li>Expand the size of the table by <code>growth_factor</code>. (For example, if <code>growth_factor = 2</code> then the size of the hash table would double on each rehashing.)</li>
<li>Migrate all the data into their proper locations in the newly-expanded hash table (i.e., each key is hashed again, with the hash function now considering the new size of the table).</li>
</ul>
<h2 id="part-2-a-speaker-recognition-system">Part 2: A Speaker Recognition System </h2>
<p><img src="img/obama-mccain.png" alt=""></p>
<p>Markov models are used significantly in speech recognition systems, and used heavily in the domains of natural language processing, machine learning, and AI.</p>
<p>A Markov model defines a probabilistic mechanism for randomly generating sequences over some alphabet of symbols. A <code>k</code>-th order Markov model tracks the last <code>k</code> letters as the context for the present letter. We will build a module called <code>markov</code> that will work for any positive value of <code>k</code> provided. This module will reside in <code>proj/markov.py</code></p>
<p>While we use the term "letter", we will actually work with all characters (letters, digits, spaces, punctuation). Our code will distinguish between upper & lower case letters.</p>
<h3 id="building-the-markov-model-learning-algorithm">Building the Markov Model (Learning Algorithm) </h3>
<p>Given a string of text from an unidentified speaker, we will use a Markov model for a known speaker to assess the likelihood that the text was uttered by that speaker. Thus, the first step is building a Markov Model on known text from a speaker. For this assignment, the Markov Model will be represented as a <code>Hashtable</code>.</p>
<p><strong>You must use your implementation and not the built-in dictionaries from Python!</strong></p>
<p>You will be given an integer value of <code>k</code> at the start of the program. Each key-value pairing inside the table contains string keys with length <code>k</code> and <code>k + 1</code> and values set to the number of times those keys appeared in the text as substrings.</p>
<p>For example, let’s say you have a file called speakerA.txt that contains the following text:</p>
<pre data-role="codeBlock" data-info="" class="language-text text"><code>This_is_.
</code></pre><p><strong>Note: You consider all characters in the text.</strong> Punctuation, special characters, whitespace characters, etc., are all considered valid. Even if the text does not make any sense, you will still generate a Markov Model.</p>
<p>We will use this text <code>"This_is_."</code> to create a Markov model for Speaker A. The algorithm is as follows:</p>
<p>Starting from the beginning of text from some known speaker:</p>
<ol>
<li>For each character in the known text, you generate a string of length <code>k</code> that includes the current character plus k−1 succeeding characters (<em>Note:</em> The model actually works by finding the k preceding letters but our way works too because we are using a wrap-around effect.).</li>
<li>For each character in the known text, you generate a string of length <code>k+1</code> that includes the current character plus <code>k</code> succeeding characters. (e.g. for the first character 'T', you'd generate 'Th' and 'Thi' if k=2)</li>
<li>For certain characters, they will not have <code>k</code> or <code>k+1</code> succeeding characters. For example, what are the succeeding characters for the character <code>'.'</code>? if <code>k = 2</code> for our example text? We will <strong>wrap around</strong>: we will think of the string circularly, and glue the beginning of the string on to the end to provide a source of the needed context. (e.g. for the example string the strings generated for the last character '.' would be '.T' and '.Th'.)</li>
</ol>
<p>Below is a diagram of all the k and k+1 length strings that are generated from the speakerA.txt file given k=2:</p>
<p><img src="img/markov1.png" alt=""></p>
<p>The Markov Model (i.e., your <code>Hashtable</code>) will contain the number of times those <code>k</code> and <code>k + 1</code> were generated via the known text. Thus, for the speakerA.txt file, the Markov Model generated will be the following:</p>
<p><img src="img/markov2.png" alt=""></p>
<p>Most of the <code>k</code> and <code>k+1</code> strings were only generated once, but some such as <code>"is"</code> occur more than once.</p>
<h3 id="determining-the-likelihood-of-unidentified-text-testing-algorithm">Determining the likelihood of unidentified text (Testing algorithm) </h3>
<p>As we stated earlier, given a string of text from an unidentified speaker, we will use the Markov model for a known speaker to assess the likelihood that the text was uttered by that speaker. Likelihood, in this context, is the probability of the model generating the unknown sequence.</p>
<p><strong>If we have built models for different speakers, then we will have likelihood values for each, and will choose the speaker with the highest likelihood as the probable source.</strong></p>
<p>These probabilities can be very small, since they take into account every possible phrase of a certain length that a speaker could have uttered. Therefore, we expect that all likelihoods are low in an absolute sense, but will still find their relative comparisons to be meaningful. Very small numbers are problematic, however, because they tax the precision available for floating-point values. The solution we will adopt for this problem is to use log probabilities instead of the probabilities themselves. This way, even a very small number is represented by a negative value in the range between zero and, for instance, -20. If our <strong>log probabilities</strong> are negative and we want to determine which probability is more likely, will the greater number (the one closer to zero) represent the higher or lower likelihood, based on how logarithms work?</p>
<p>Note that we will use the <code>math.log</code> function, which calculates natural logarithms. Your code should use this base for its logarithms. While any base would suffice to ameliorate our real number precision problem, we will be comparing your results to the results from our implementation, which itself uses natural logs.</p>
<p>The process of determining likelihood given a model is similar to the initial steps of building the Markov Model in the previous section. Starting from the beginning of text for some unknown speaker:</p>
<ol>
<li>
<p>For each character in the unknown text, you generate a string of length k that includes the current character plus k − 1 succeeding characters.</p>
</li>
<li>
<p>For each character in the unknown text, you generate a string of length k + 1 that includes the current character plus k succeeding characters.</p>
</li>
<li>
<p>For certain characters, they will not have k or k + 1 succeeding characters. You will use the same wrap around mechanism as described previously. (<strong>Note:</strong> These steps are identical to the steps in the learning process, think about what that suggests for your design!)</p>
</li>
<li>
<p>We then need to compute the probability that the string of length <code>k+1</code> follows a string of length <code>k</code>. To do this, look up in our data table the number of times that we have observed the string of length <code>k+1</code> (we will call this <code>M</code>), and the number of times we have observed the string of length <code>k</code> (we will call this <code>N</code>). The log probability would be <code>log(M/N)</code>.</p>
</li>
<li>
<p>We need to keep in mind that we are constructing the model with one set of text and using it to evaluate other text. The specific letter sequences that appear in that new text are not necessarily guaranteed ever to have appeared in the original text. Consequently, we are at risk of dividing by zero.</p>
</li>
</ol>
<p>It turns out that there is a theoretically-justifiable solution to this issue, called Laplace smoothing. We modify the simple equation above by adding to the denominator the number of unique characters that appeared in the original text we used for modeling. For instance, if every letter in the alphabet appeared in the text, we add 26. (In practice, the number is likely to be greater, because we distinguish between upper- and lower-case letters, and consider spaces, digits, and punctuation as well.) Because we have added this constant to the denominator, it will never be zero. Next, we must compensate for the fact that we have modified the denominator; a theoretically sound way to balance this is to add one to the numerator. Symbolically, if N is the number of times we have observed the k succeeding letters and M is the number of times we have observed those letters followed by the present letter, and S is the size of the ”alphabet” of possible characters, then our probability is</p>
<pre data-role="codeBlock" data-info="" class="language-text text"><code>log((M + 1)/(N + S))
</code></pre><p>For example, lets say you have file called speakerC.txt (i.e., the unidentified text):</p>
<pre data-role="codeBlock" data-info="" class="language-text text"><code>This
</code></pre><p>Calculating the total likelihood requires summing all of the individual likelihoods. It will be done in the following way using the model we built from speakerA.txt:</p>
<p><img src="img/markov3.png" alt=""></p>
<p><strong>Note:</strong></p>
<ul>
<li>When we say "Model(K)" we are looking at the hash table inside the model and retrieving the counts for that string.</li>
<li>S is the number of unique characters that we encountered while building the model. This means while building the model you need to create a set of all unique characters.</li>
</ul>
<h3 id="markov-class">Markov Class </h3>
<p>Inside <code>proj/markov.py</code>, you will implement a <code>Markov</code> class with the following methods</p>
<ol>
<li>An <code>__init__(self, k, text, use_hashtable)</code> method.</li>
</ol>
<ul>
<li><code>k</code> will be the order <code>k</code> for your Markov model.</li>
<li><code>text</code> will be the text used to build the model.</li>
<li>If <code>use_hashtable</code> is true, you will use your <code>Hashtable</code> class internally. We have provided the constant <code>HASH_CELLS</code> as a good starting point for your hashtables, though they will need to grow significantly to accommodate the sample text. You can instantiate your hash table as <code>Hashtable(HASH_CELLS, 0, 0.5, 2)</code></li>
<li>If <code>use_hashtable</code> is false, you will use the built in <code>dict</code> data structure to represent the internal Markov model.<br>
This will allow us to run performance tests on whether your implementation of the hash table is faster or slower than using the built in <code>dict</code> type for this problem.</li>
</ul>
<ol start="2">
<li><code>log_probability(self, s)</code> takes a string and returns the log probability that the modeled speaker uttered it, using the approach described above.</li>
</ol>
<h3 id="identify_speaker"><code>identify_speaker</code> </h3>
<p>You will also need to implement the function <code>identify_speaker(speech1, speech2, speech3, k, use_hashtable)</code>.</p>
<p>This function will be called by the <code>main</code> function with three strings (<code>speech1</code>, <code>speech2</code>, <code>speech3</code>) and a value of <code>k</code> and <code>use_hashtable</code> to instantiate your Markov model.</p>
<p>The function should learn models for <code>speech1</code> and <code>speech2</code>, calculate the <strong>normalized log probabilities</strong> that those two speakers uttered <code>speech3</code> and determine which speaker is more likely.</p>
<p>The return value of the function should be a 3-element tuple with the two calculated probabilities and a string "A" or "B" indicating which speaker most likely gave <code>speech3</code>.</p>
<p>While the log_probability function yields the likelihood that a particular speaker uttered a specified string of text, this probability may be misleading because it depends upon the length of the string. Because there are many more possible strings of a greater length, and these probabilities are calculated across the universe of all possible strings of the same length, we should expect to see significant variance in these values for different phrases. To reduce this effect, we will divide all of our probabilities by the length of the string, to yield normalized ones. To be clear, this division does not occur in log_probability, but rather in identify_speaker. Also, note that we will be calculating log probabilities, under different models, for the same string. Thus, string length differences are not a factor during a single run of our program. Rather, we choose to normalize in this fashion because we want outputs for different runs to be more directly comparable when they may pertain to different length quotes being analyzed.</p>
<p><strong>Note:</strong> You can test our implementation of this file by running the tests inside <code>proj/test_markov.py</code>. You must pass these tests in less than 60 seconds to avoid a penalty.</p>
<p><strong>Note:</strong> There are additional test cases in <code>proj/test_markov_long.py</code>, these are mostly redundant with the tests in <code>test_markov</code> and can take a long time to run.</p>
<h2 id="part-3-driver-files--performance-testing">Part 3: Driver Files & Performance Testing </h2>
<p>You will need to implement a driver for the speaker recognition system inside <code>proj/driver.py</code>.<br>
We have provided a set of files containing text from United States presidential debates from the 2004 and<br>
2008 general elections. In the 2004 election, George W. Bush debated John Kerry; in the 2008 debates,<br>
Barack Obama went up against John McCain. We have provided single files for Bush, Kerry, Obama, and<br>
McCain to use to build models. These files contain all the text uttered by the corresponding candidates from two debates. We have also provided directories from the third debates of each election year, containing many files, appropriately labeled, that have remarks made by one of the candidates.</p>
<p>Inside <code>proj/driver.py</code> the <code>if __name__ == "__main__"</code> block will take several command line arguments:</p>
<ul>
<li>The first argument is the name of a text file with speeches from speaker A.</li>
<li>The second argument is the name of a text file with speeches from speaker B.</li>
<li>The third argument is the name of a text file with text to classify.</li>
<li>The fourth argument is the order <code>k</code> to use for the Markov models.</li>
<li>The fifth argument is either the string "hashtable" or "dict", indicating which implementation to use.</li>
</ul>
<p>Your application should read in text from the text files and call your <code>identify_speaker</code> function to retrieve log probabilities and the most likely speaker. It should then print output like the following:</p>
<pre data-role="codeBlock" data-info="" class="language-text text"><code>$ python proj/driver.py speeches/bush1+2.txt speeches/kerry1+2.txt speeches/bush-kerry3/BUSH-0.txt 2 hashtable
Speaker A: -2.1670591295191572
Speaker B: -2.2363636778055525
Conclusion: Speaker A is most likely
</code></pre><h3 id="performance-testing">Performance Testing </h3>
<p>Finally, you will write another driver file <code>proj/performance.py</code> that will run performance tests on your <code>Markov</code> class with your <code>Hashtable</code> and <code>dict</code> modes.</p>
<p>This file will also take five command line arguments:</p>
<ul>
<li>The first argument is the name of a text file with speeches from speaker A.</li>
<li>The second argument is the name of a text file with speeches from speaker B.</li>
<li>The third argument is the name of a text file with text to classify.</li>
<li>The fourth argument is the maximum <code>k</code> to use for the Markov models.</li>
<li>The fifth argument is the number of runs to measure performance.</li>
</ul>
<p>You will perform time measurements between your Markov model states to determine which one is faster and record these timings inside a single pandas dataframe. Specifically, you will time how fast the identify_speaker returns based on specifying a range of k values starting at 1 to “k” values (inclusive) and running it multiple times (i.e., the “run” command line argument (inclusive)). For example, assume we run the program as follows,</p>
<pre data-role="codeBlock" data-info="" class="language-text text"><code>$ python proj/performance.py speeches/bush1+2.txt speeches/kerry1+2.txt speeches/bush-kerry3/BUSH-0.txt 2 3
</code></pre><p>This would indicate that you want three runs, for <code>k=1</code> and <code>k=2</code>, your pandas dataframe might look something like:</p>
<table>
<thead>
<tr>
<th>Implementation</th>
<th>K</th>
<th>Run</th>
<th>Time</th>
</tr>
</thead>
<tbody>
<tr>
<td>hashtable</td>
<td>1</td>
<td>1</td>
<td>0.345</td>
</tr>
<tr>
<td>hashtable</td>
<td>1</td>
<td>2</td>
<td>0.302</td>
</tr>
<tr>
<td>hashtable</td>
<td>1</td>
<td>3</td>
<td>0.386</td>
</tr>
<tr>
<td>hashtable</td>
<td>2</td>
<td>1</td>
<td>0.585</td>
</tr>
<tr>
<td>hashtable</td>
<td>2</td>
<td>2</td>
<td>0.598</td>
</tr>
<tr>
<td>hashtable</td>
<td>2</td>
<td>3</td>
<td>0.611</td>
</tr>
<tr>
<td>dict</td>
<td>1</td>
<td>1</td>
<td>0.045</td>
</tr>
<tr>
<td>dict</td>
<td>1</td>
<td>2</td>
<td>0.130</td>
</tr>
<tr>
<td>dict</td>
<td>1</td>
<td>3</td>
<td>0.075</td>
</tr>
<tr>
<td>dict</td>
<td>2</td>
<td>1</td>
<td>0.145</td>
</tr>
<tr>
<td>dict</td>
<td>2</td>
<td>2</td>
<td>0.166</td>
</tr>
<tr>
<td>dict</td>
<td>2</td>
<td>3</td>
<td>0.202</td>
</tr>
</tbody>
</table>
<p>We run multiple times because a single run can be affected by unrelated issues elsewhere on the system.</p>
<p>You will use these timings to make a <code>seaborn</code> graph to show how times fluctuated based on the implementation and increasing <code>k</code> value. For each timing above you will then take the average of those timings for each <code>k</code>.</p>
<p><strong>Note:</strong> The structure above might not be the way you choose to structure your dataframe. <strong>You can only use one pandas dataframe.</strong></p>
<p>Given that you will be graphing the data inside the dataframe, I would highly recommend you think about how to structure the code based on how <code>seaborn</code> will need the data.</p>
<p><strong>Pandas Tip:</strong> Keep <code>df.groupby</code>and <code>df.loc</code> in mind.</p>
<p>You can accurately time your code using a snippet like the below:</p>
<pre data-role="codeBlock" data-info="" class="language-text text"><code>import time
start = time.perf_counter()
tup = identify_speaker(speech1, speech2, speech3, k, use_hashtable=True)
elapsed = start - time.perf_counter()
</code></pre><h3 id="graphing">Graphing </h3>
<p>After producing your <code>pandas</code> dataframe, you will produce a <code>seaborn</code> graph where the x axis will be the <code>k</code> values and the y axis will be the <strong>average time of the runs</strong> for each k value per implementation type. Here's an example graph:</p>
<p><img src="img/graph.png" alt=""></p>
<p>Your graph should be similar, with axes labeled and appropriately scaled so we can see the values. (That is, if your values are in the range [0, 1] do not make the scale of your graph [0, 14].</p>
<p>You can use a point plot with the following arguments:</p>
<pre data-role="codeBlock" data-info="" class="language-text text"><code>sns.pointplot(..., linestyle='-', marker='o')
</code></pre><p>Your code should save this graph as <code>execution_graph.png</code>.</p>
<h2 id="debugging-suggestions">Debugging Suggestions </h2>
<ul>
<li>Make sure you have chosen the right data structure for the Markov model based on the use_hashtable parameter!</li>
<li>Check your code for handling collision and deletion carefully. It is a common source of errors. While we provide basic tests to ensure the external interface is working correctly, it is highly advisable to write additional tests specific to your implementation of the hash table.</li>
<li>Test the code for constructing a Markov model using a very small string, such as "abcabd". Check your data structure to make sure that you have the right set of keys and the correct counts.</li>
<li>Make sure you are using the correct value for S, the number of unique characters that appeared in the training text.</li>
</ul>
<h2 id="acknowledgment">Acknowledgment </h2>
<p>This assignment’s text and documentation originated from CAPP 30122 Team @ The University of Chicago. The original development of the assignment was done by Rob Schapire with contributions from Kevin Wayne.</p>
</div>
</body></html>