Difference between revisions of "Solipsist Development"

From Robert-Depot
Jump to: navigation, search
(Week 4)
(How To Compile on OS X)
 
(165 intermediate revisions by the same user not shown)
Line 1: Line 1:
 +
[[Home | <<< back to Wiki Home]]
 +
 +
==pocket sphinx by hand==
 +
my pocketsphinx:
 +
<code>
 +
~/supercollider/solipsist/binaries/pocketsphinx-osx-continuous\
 +
-hmm ~/supercollider/solipsist/data/models/hmm/en_US/hub4wsj_sc_8k -dict\
 +
~/supercollider/solipsist/data/models/script/script.dic\
 +
-lm ~/supercollider/solipsist/data/models/script/script.lm\
 +
-infile robertrauschenberg1_rs.wav
 +
</code>
 +
default pocketsphinx:
 +
<code>
 +
pocketsphinx_continuous -hmm ~/supercollider/solipsist/data/models/hmm/en_US/hub4wsj_sc_8k\
 +
-dict ~/supercollider/solipsist/data/models/script/script.dic\
 +
-lm ~/supercollider/solipsist/data/models/script/script.lm\
 +
-infile robertrauschenberg1_rs.wav
 +
</code>
 +
 
==Proposal - Solipsist ==
 
==Proposal - Solipsist ==
 
I have been working with voice recognition technologies and Mel Bochner's text 'Serial Art, Systems, Solipsism', developing a device for performance and exchange between human and computer.  The device consists of a microphone, speech recognition system, software, and a receipt printer.  The format of the conversation is an dialog between human voice and printed receipts in text--the system transcribes (and validates) what it hears in terms of the words it knows.  As is characteristic of voice recognition, face recognition, and other kinds of machine perception that operate within explicitly defined or statistically trained spaces of "perception", this is a solipsistic system, "denying the existence of anything outside the confines of its own mind" (Bochner, 1967).  This character of the solipsist is one Mel Bochner evoked to describe the autonomy and denial of external reference in minimalist sculpture of the 1960s, but which I find particularly appropriate to describe current "smart" technologies--ultimately the agency of the system still comes down to whatever agency the programmers have embedded in it, a sort of ventriloquism.  This idea of a closed, narrowly parameterized space of perception in the machine is an interesting model for (and contrast to) issues of language, vocabulary, and free expression in humans--an exploration I intend to pursue through this project.
 
I have been working with voice recognition technologies and Mel Bochner's text 'Serial Art, Systems, Solipsism', developing a device for performance and exchange between human and computer.  The device consists of a microphone, speech recognition system, software, and a receipt printer.  The format of the conversation is an dialog between human voice and printed receipts in text--the system transcribes (and validates) what it hears in terms of the words it knows.  As is characteristic of voice recognition, face recognition, and other kinds of machine perception that operate within explicitly defined or statistically trained spaces of "perception", this is a solipsistic system, "denying the existence of anything outside the confines of its own mind" (Bochner, 1967).  This character of the solipsist is one Mel Bochner evoked to describe the autonomy and denial of external reference in minimalist sculpture of the 1960s, but which I find particularly appropriate to describe current "smart" technologies--ultimately the agency of the system still comes down to whatever agency the programmers have embedded in it, a sort of ventriloquism.  This idea of a closed, narrowly parameterized space of perception in the machine is an interesting model for (and contrast to) issues of language, vocabulary, and free expression in humans--an exploration I intend to pursue through this project.
Line 7: Line 26:
  
 
Finally, I need to address sonic properties of the piece and its time course as a composition.  The voice of the viewer as they speak into the microphone is one sound source in the system, and the receipt printer has a very assertive (and retro) dot-matrix-ey  sound as it prints out text and cuts rolls of paper.  I need to make some decisions about how to use the sounds of the speaker and the printer over the course of the piece.  Also, will I add in additional sound sources such as more printers, pre-recorded voices, voices of past participants, or processed sounds?  There are additional possibilities here for rhythmic exchanges between the percussive sound of the printer and the speaker, for long pauses, silences, and repetitions.  Additionally, I need to establish some overall arc for the piece--does an encounter with the system travel through to one pre-ordained conclusion?  Are there multiple branching possibilities that change depending on what the viewer says and how they respond to the printouts?  Finally there is a relationship to be explored between speech as text and speech as sound--a parallel to the roles of printing as text and printing as sound.  The fundamental distinctions between text, sound, and speech as kinds of communication and expression can be ripe territory for exploration.  I suspect that these conceptual and compositional questions will occupy most of my time this quarter and comprise the bulk of the work that I need to do.
 
Finally, I need to address sonic properties of the piece and its time course as a composition.  The voice of the viewer as they speak into the microphone is one sound source in the system, and the receipt printer has a very assertive (and retro) dot-matrix-ey  sound as it prints out text and cuts rolls of paper.  I need to make some decisions about how to use the sounds of the speaker and the printer over the course of the piece.  Also, will I add in additional sound sources such as more printers, pre-recorded voices, voices of past participants, or processed sounds?  There are additional possibilities here for rhythmic exchanges between the percussive sound of the printer and the speaker, for long pauses, silences, and repetitions.  Additionally, I need to establish some overall arc for the piece--does an encounter with the system travel through to one pre-ordained conclusion?  Are there multiple branching possibilities that change depending on what the viewer says and how they respond to the printouts?  Finally there is a relationship to be explored between speech as text and speech as sound--a parallel to the roles of printing as text and printing as sound.  The fundamental distinctions between text, sound, and speech as kinds of communication and expression can be ripe territory for exploration.  I suspect that these conceptual and compositional questions will occupy most of my time this quarter and comprise the bulk of the work that I need to do.
 +
 +
Where else do we do this sort of projection and anthropomorphism? (projecting psychology or attributing intention to non-intelligent systems)
  
 
The most obvious technical challenges I foresee at this point are the implementation of sound input and pre-processing with supercollider, and interfacing from supercollider to the speech recognition library and receipt printer.  As part of this project is a critical investigation of the strengths and limitations of automatic speech recognition (ASR) technology, I intend to get more involved with the internal mechanisms of speech recognition as implemented in the Sphinx-4 library.  A more comprehensive understanding of that technology is necessary to figure out how to tweak it and expose its internal character and assumptions.  
 
The most obvious technical challenges I foresee at this point are the implementation of sound input and pre-processing with supercollider, and interfacing from supercollider to the speech recognition library and receipt printer.  As part of this project is a critical investigation of the strengths and limitations of automatic speech recognition (ASR) technology, I intend to get more involved with the internal mechanisms of speech recognition as implemented in the Sphinx-4 library.  A more comprehensive understanding of that technology is necessary to figure out how to tweak it and expose its internal character and assumptions.  
Line 24: Line 45:
 
**see [[#Code | Code]] section below.
 
**see [[#Code | Code]] section below.
 
'''<nowiki>*</nowiki>I do not have these items yet.'''
 
'''<nowiki>*</nowiki>I do not have these items yet.'''
==Open Questions==
+
==Timeline==
*Where else do we do this sort of projection and anthropomorphism? (projecting psychology or attributing intention to non-intelligent systems)
+
'''Week 1 - 2'''
  
==Timeline==
 
===Week 1 - 2===
 
 
Introduction to course and project development.
 
Introduction to course and project development.
===Week 3 - Proposal - 4/12===
+
 
 +
 
 +
'''Week 3 - Proposal - 4/12'''
 
#Write this.
 
#Write this.
 
#Meet with Juan.
 
#Meet with Juan.
 
#Find desk and desk lamp for "me vs. the computer" staging of microphone/speech recognition system.
 
#Find desk and desk lamp for "me vs. the computer" staging of microphone/speech recognition system.
  
===Week 4 - 4/19===
+
'''Week 4'''
 +
 
 
Work time.
 
Work time.
  
===Week 5 - MILESTONE 1 - 4/26===
+
'''Week 5 - MILESTONE 1'''
Working model of each of two tracks:
+
 
 +
Working model of each of <s>two</s> one tracks:
 
#Participant speaking to computer.
 
#Participant speaking to computer.
#Computer/printer speaking to itself (feedback loop).  Interpreting printer sounds as speech.  Or transforming them into spech.
+
#<s>Computer/printer speaking to itself (feedback loop).  Interpreting printer sounds as speech.  Or transforming them into speech.</s>
 +
 
 +
'''Week 6'''
  
===Week 6 - 5/3===
 
 
Realize that the best approach will combine elements of each of the two tracks above.
 
Realize that the best approach will combine elements of each of the two tracks above.
  
===Week 7 - MILESTONE 2 - 5/10===
+
'''Week 7 - MILESTONE 2'''
 +
 
 
*Experiments with the characterization of the system:
 
*Experiments with the characterization of the system:
 
**Software agent?  agency.  towards what goals?
 
**Software agent?  agency.  towards what goals?
Line 57: Line 82:
 
I imagine these two go hand in hand--that the choice of particular texts will lend much of the character to the piece.
 
I imagine these two go hand in hand--that the choice of particular texts will lend much of the character to the piece.
  
===Week 8 - 5/17===
+
'''Week 8'''
 +
 
 
Have others experience the system, try it out.
 
Have others experience the system, try it out.
  
===Week 9 - MILESTONE 3 - 5/24===
+
'''Week 9 - MILESTONE 3'''
 +
 
 
Near-final form, near-final realization.
 
Near-final form, near-final realization.
  
 
Viewer interaction tests.
 
Viewer interaction tests.
  
===Week 10 - 5/31===
+
'''Week 10'''
 +
 
 
Final changes, improvements, last minute blitz.
 
Final changes, improvements, last minute blitz.
===Presentation - 6/7===
+
 
 +
'''Presentation'''
  
 
==Progress==
 
==Progress==
Line 86: Line 115:
 
**in sphinx, SpeechClassifier
 
**in sphinx, SpeechClassifier
 
***"uses Bent Schmidt Nielsen's algorithm. Each time audio comes in, the average signal level and the background noise level are updated, using the signal level of the current audio. If the average signal level is greater than the background noise level by a certain threshold value (configurable), then the current audio is marked as speech. Otherwise, it is marked as non-speech." from http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/frontend/endpoint/SpeechClassifier.html
 
***"uses Bent Schmidt Nielsen's algorithm. Each time audio comes in, the average signal level and the background noise level are updated, using the signal level of the current audio. If the average signal level is greater than the background noise level by a certain threshold value (configurable), then the current audio is marked as speech. Otherwise, it is marked as non-speech." from http://cmusphinx.sourceforge.net/sphinx4/javadoc/edu/cmu/sphinx/frontend/endpoint/SpeechClassifier.html
 +
***the threshold in the config.xml file I am using is "13".  (a unit-less thirteen).
 
***"Could anyone provide the link of the paper about that algorithm?" "I don't think this algorithm is worth a paper. It's very simple" http://sourceforge.net/projects/cmusphinx/forums/forum/382337/topic/4364498
 
***"Could anyone provide the link of the paper about that algorithm?" "I don't think this algorithm is worth a paper. It's very simple" http://sourceforge.net/projects/cmusphinx/forums/forum/382337/topic/4364498
 
**Bent K. Schmidt-Nielsen http://www.merl.com/people/?user=bent
 
**Bent K. Schmidt-Nielsen http://www.merl.com/people/?user=bent
 
*Another aspect that interests me--as far as me as performer, hearing my own voice--is "I am Sitting In A Room" by Alvin Lucier [http://ubumexico.centro.org.mx/sound/source/Lucier-Alvin_Sitting.mp3].  I am just listening to this again now, I had never noticed that he had a stutter.  This adds an interesting psychological dimension to the performance, where he states "I regard this activity not so much as a demonstration of a physical fact, but more as a way to smooth out an irregularities my speech might have."  "...so that any semblance of my speech.  With perhaps the exception of rhythm is destroyed.  What you will hear then are the natural resonant frequencies of the room, articulated by speech.  I regard this activity not so much as a demonstration of a physical fact, but more as a way to smooth out any irregularities my speech might have."
 
*Another aspect that interests me--as far as me as performer, hearing my own voice--is "I am Sitting In A Room" by Alvin Lucier [http://ubumexico.centro.org.mx/sound/source/Lucier-Alvin_Sitting.mp3].  I am just listening to this again now, I had never noticed that he had a stutter.  This adds an interesting psychological dimension to the performance, where he states "I regard this activity not so much as a demonstration of a physical fact, but more as a way to smooth out an irregularities my speech might have."  "...so that any semblance of my speech.  With perhaps the exception of rhythm is destroyed.  What you will hear then are the natural resonant frequencies of the room, articulated by speech.  I regard this activity not so much as a demonstration of a physical fact, but more as a way to smooth out any irregularities my speech might have."
**that second part, "What you will hear then are the natural resonant frequencies of the room" is analogous for me to sounding out the linguisitic possiblities of the speech recognition system.  I should explore the limits of the wall stree journal (WSJ_5k) model and the larger corpi of language/acoustic information.  It is about *articulating* the limits of a system with your own voice.
+
*that second part, "What you will hear then are the natural resonant frequencies of the room" is analogous for me to sounding out the linguisitic possiblities of the speech recognition system.  I should explore the limits of the wall stree journal (WSJ_5k) model and the larger corpi of language/acoustic information.  It is about *articulating* the limits of a system with your own voice.
 +
 
 +
=== Week 5 - Milestone 1 ===
 +
Done:
 +
*working model
 +
*bidirectional OSC communication between processing/supercollider
 +
*some planning
 +
 
 +
Pictures (on my computer)
 +
 
 +
To Do - Conceptual:
 +
*end state? (criteria for cut)
 +
*changing identities?
 +
*learn from participants? (if so, what model/vocabulary am I matching against, e.g. WSJ4K, TI_DIGITS, custom vocab?)
 +
*bored behavior? (when no-one is in the room)
 +
 
 +
To Do - Technical:
 +
*folding camp/travel table.
 +
*speech classifier in supercollider
 +
*save out audio files of speech sound
 +
*responder.  state modeling? what language?
 +
*sound isolation possibilities.  multi-pane commercial glazing would be my preference, though it is expensive I'm sure.
 +
 
 +
=== Week 6 ===
 +
Pocketsphinx:
 +
* Pocketsphinx — recognizer library written in C
 +
* Tutorial http://cmusphinx.sourceforge.net/wiki/tuturialpocketsphinx
 +
* API Documentation http://cmusphinx.sourceforge.net/api/pocketsphinx/
 +
Language Models:
 +
*Build a language model with cmulmtk http://cmusphinx.sourceforge.net/wiki/tutoriallm
 +
*Online language model toolkit http://www.speech.cs.cmu.edu/tools/lmtool-new.html
 +
Sphinx Website Stuff:
 +
 
 +
*segmentation and diarization (identifiying distinct speakers) in long form audio: http://cmusphinx.sourceforge.net/wiki/speakerdiarization
 +
Articles:
 +
*Robots with Bad Accents- Living with Synthetic Speech 2008. [[File:Robots with Bad Accents- Living with Synthetic Speech 2008.pdf]]
 +
 
 +
=== Week 7 - Milestone 2 ===
 +
*Compiled pocket sphinx as static, 32/64bit library.
 +
**The x-code demo project is here: http://svn.roberttwomey.com/supercollider/pocketsphinx-osx/
 +
**which I pretty much copied from here: http://cmusphinx.sourceforge.net/wiki/tuturialpocketsphinx
 +
In Class Discussion:
 +
<pre>
 +
put a coin to purchase the receipt
 +
occupying roles
 +
master script: stelios
 +
learning from its own output: nico
 +
that's what I said: juan
 +
more than just the filter: juan
 +
some analysis of the text
 +
reconfigure the text
 +
what's the thing about the receipt printer?: nico
 +
coin-op justifies the receipt.  commercial transaction
 +
record of a transaction
 +
oracle, coin-op oracle: nico
 +
trimpin, coin-op sculpture: juan
 +
automatic poetry generator
 +
ask people to engage poetically with it, record them
 +
the idea of the residual, the parts/things that people say that are outside of the system (it's closed language-field)
 +
site-specificity: administrative  scout out sites with speech rec system.
 +
not be so one-dimensional, analyze tonality, intonation, rate.  profiling affect.
 +
</pre>
 +
 
 +
=== Week 8 ===
 +
Josh P. suggests white rabbit code (libsamplerate) to do my supercollider -> 16KHz (sphinx) downsampling.  The following three libraries can be built and installed:
 +
#pkg-config 0.23 builds on OS X: http://mac.softpedia.com/progDownload/pkg-config-Download-46896.html
 +
#libsndfile: http://www.mega-nerd.com/libsndfile/#Download
 +
#libsamplerate: http://www.mega-nerd.com/SRC/download.html
 +
Convert audio file to 16-bit little-endian audio raw:
 +
*<code> ffmpeg -i solipsism1.wav -acodec pcm_s16le  -ac 1 -ar 16000 solipsism1_pcm_s16le.wav</code>
 +
 
 +
*cmu07a.dic
 +
Later in the week:
 +
*Convert from .lm (produced by online language tool) to DMP:  http://cmusphinx.sourceforge.net/wiki/tutoriallm#converting_model_into_dmp_format
 +
*sample audio file: [[File:Solipsism sphinx pcm s16le.wav]]
 +
*Speech segmenter tool in supercollider http://svn.roberttwomey.com/supercollider/supercollider/speechtool.scd
 +
 
 +
=== Week 9 - Milestone 3 ===
 +
To Do:
 +
*finish connecting up components (e.g. SC to pocketsphinx and back)
 +
*enable self-modifying grammar/Language Model
 +
 
 +
*Updated pocketsphinx-osx code to accept command line arguments: 
 +
**run recognition with a grammar: <code> ./pocketsphinx-osx-continuous -jsgf solipsist.gram -infile solipsism1_pcm_s16le.wav </code>
 +
**run recognition with a custom language model: <code> ./pocketsphinx-osx-continuous -dict 9316.dic -lm 9316.lm </code>
 +
 
 +
=== Week 10 ===
 +
 
 +
Allright!  I made much progress on my speechtool supercollider code. (http://svn.roberttwomey.com/supercollider/supercollider/speechtool.scd)
 +
 
 +
New features:
 +
*It is relatively sensitive to sound levels and does a decent job of detecting possible speech above background noise.
 +
*It correctly saves out raw time-stamped audio files to '''~/Sounds/SpeechFiles/''' of all sounds detected and sent to pocket-sphinx
 +
*It calls <code>sndfile-resample</code> to resample 44.1KHz audio files to 16KHz for pocket-sphinx.
 +
*It runs recognition with pocketsphinx and prints results to the post window in the format <code>Recognized: .... </code>
 +
*It saves a text logfile in '''~/Sounds/SpeechFiles/logs''' with the DateStamped .wav file name and the recognized string.
 +
*It now works with a JSGF grammar rather than a statistical language model.  Example command line options:
 +
**<code>./pocketsphinx-osx-continuous -jsgf solipsist.gram -infile numbers.raw</code>
 +
**with this grammar file: http://svn.roberttwomey.com/supercollider/pocketsphinx-osx/build/Debug/solipsist.gram
 +
*With an SLM from online tool:
 +
**<code>./pocketsphinx-osx-continuous -dict /Users/rtwomey/Documents/dxarts463_sp11/mmpi-2/9316.dic -lm /Users/rtwomey/Documents/dxarts463_sp11/mmpi-2/9316.lm</code>
 +
*With an SLM from local tools:
 +
**<code>./pocketsphinx-osx-continuous -lm /Users/rtwomey/Documents/dxarts463_sp11/mmpi2_slm/mmpi2.arpa</code>
 +
 
 +
Info on default models:
 +
*HMM (acoustic?) hub4wsj_sc_8k (hmm/en_US/hub4wsj_sc_8k)
 +
*Dictionary: cmu07a.dic, 133436 words (lm/en_US/cmu07a.dic)
 +
*Filler dictionary: noisedict (hmm/en_US/hub4wsj_sc_8k/noisedict)
 +
*LM... either hub4.5000.DMP or wsj0vp.5000.DMP, I am guessing. (?)
 +
 
 +
To Do:
 +
*Field recordings with baseline vocabulary and language models.  (I think this is the WSJ5K)
 +
*Test with different LMs or Grammars
 +
*Make more decisions.
 +
 
 +
=== Finals Week ===
 +
 
 +
Rhyming + Homophonic sounds:
 +
*english homophones - http://www.all-about-spelling.com/list-of-homophones.html
 +
*approximate homophonic phrases (same input repeated multiple times to varied interpretation):
 +
** "receipt" -> three feet
 +
** "receipt" -> three feet
 +
**"receipt" -> the heat
 +
** "receipt" -> proceed
 +
 
 +
Building a Grammar:
 +
*python script makegrammar_for_cmusphinx.py - http://svn.roberttwomey.com/language/nltk2.0/makegrammar_for_cmusphinx.py, login: guest pw: user
 +
 
 +
Building a Language Model:
 +
*http://cmusphinx.sourceforge.net/wiki/tutoriallm
 +
 
 +
Generating a Dictionary:
 +
*http://cmusphinx.sourceforge.net/wiki/tutorialdict
 +
*from command line: <code> perl make_pronunciation.pl -tools /Users/rtwomey/code/cmusphinx/trunk/logios/Tools/ -dictdir /Users/rtwomey/code/cmusphinx/trunk/logios/Tools/MakeDict/lib/dict -words /Users/rtwomey/Documents/dxarts463_sp11/mmpi2_slm/mmpi2.tmp.vocab -handdict NONE -dict results.dic
 +
</code>
 +
*logios lexicon tool http://www.speech.cs.cmu.edu/tools/lextool.html
 +
*creating custom pronunciation dictionaries
 +
**using phonetisaurus: http://trulymadlywordly.blogspot.com/2011/05/using-freetts-and-phonetisaurus-for.html
 +
**using freeTTS: http://sourceforge.net/projects/cmusphinx/forums/forum/382337/topic/3582401?message=9115293
 +
*to build phonetisaurus
 +
*to build mitlm on OS X:
 +
**edit autogen.sh, change '''libtoolize''' to '''glibtoolize'''
 +
*requires liblbfgs.  to build liblbfgs:
 +
**<code>git clone git://github.com/chokkan/liblbfgs.git liblbfgs</code>
 +
**edit autogen.sh, change '''libtoolize''' to '''glibtoolize'''
 +
*FAILS HERE.  I NEED FORTRAN. (srsly?)
 +
 
 +
Language World:
 +
*items in the loft office http://svn.roberttwomey.com/supercollider/data/texts/language%20world%20-%20upstairs%20office.txt
 +
**as a java speech grammar: http://svn.roberttwomey.com/supercollider/grammars/languageworld.jsgf
 +
*items in the observatory http://svn.roberttwomey.com/supercollider/data/texts/observatory_desc.txt
 +
**trained language model http://svn.roberttwomey.com/supercollider/data/models/observatory/
 +
 
 +
Language Models:
 +
*script http://svn.roberttwomey.com/supercollider/data/models/script/
 +
*mmpi2 trained online http://svn.roberttwomey.com/supercollider/data/models/mmpi2_online/
 +
*mmpi2 trained locally (NO PRONUNCIATION DICTIONARY) http://svn.roberttwomey.com/supercollider/data/models/mmpi2_local/
 +
 
 +
Grammars:
 +
*language world http://svn.roberttwomey.com/supercollider/data/grammars/languageworld.jsgf
 +
*script (NON-FUNCTIONAL) http://svn.roberttwomey.com/supercollider/data/grammars/script.jsgf
 +
 
 +
== Final ==
 +
=== Tar File ===
 +
[http://wiki.roberttwomey.com/images/2/28/Portable_solipsist.tar portable_solipsist.tar]
 +
 
 +
contains supercollider code, os x binaries, data files, etc.
 +
see README inside for more info.
 +
 
 +
=== Report ===
 +
coming wed. am.
  
 
==Code==
 
==Code==
*CMU Sphinx-4 Automatic Speech Recognition (ASR) library - http://sourceforge.net/projects/cmusphinx/files/
+
===Processing===
 
*Sphinx-4 wrapper for processing: http://svn.roberttwomey.com/processing/libraries/sphinx/
 
*Sphinx-4 wrapper for processing: http://svn.roberttwomey.com/processing/libraries/sphinx/
*Example code for Grammar-based recognition: http://svn.roberttwomey.com/processing/sphinxBochner/
+
*Example code for Grammar-based recognition: http://svn.roberttwomey.com/processing/sphinxGrammarTest/
 
*Example code for Statistical Language Model (SLM) based recognition: http://svn.roberttwomey.com/processing/sphinxSLMTest/
 
*Example code for Statistical Language Model (SLM) based recognition: http://svn.roberttwomey.com/processing/sphinxSLMTest/
*In intend to use Supercollider as the central software for sound-input and processing.
+
 
 +
===Supercollider===
 +
*Speech Segmenter tool in supercollider http://svn.roberttwomey.com/supercollider/supercollider/speechtool.scd
 +
 
 +
===Pocketsphinx command-line Recognizer in OS X===
 +
*http://svn.roberttwomey.com/supercollider/pocketsphinx-osx/
 +
*http://sourceforge.net/projects/cmusphinx/files/pocketsphinx/0.7/pocketsphinx-0.7.tar.gz/download
 +
 
 +
===Build a Language Model Online===
 +
*upload a list of sentences (i.e. text file) here: http://www.speech.cs.cmu.edu/tools/lmtool-new.html
 +
*get .dic and .lm files.
 +
 
 +
===Build a Language Model Locally===
 +
*see week 10 and final week above... not yet done
 +
 
 +
===Secret Rabbit Code (libsamplerate)===
 +
*libsamplerate (used for sndfile-resample): http://www.mega-nerd.com/SRC/download.html
 +
*libsndfile: http://www.mega-nerd.com/libsndfile/#Download
 +
 
 +
== How To Compile on OS X ==
 +
===Download and Install Homebrew===
 +
from https://github.com/mxcl/homebrew/wiki/installation
 +
<pre> /usr/bin/ruby -e "$(curl -fsSL https://raw.github.com/gist/323731)" </pre>
 +
===Install libsamplerate===
 +
<pre> brew install libssamplerate </pre>
 +
this should install libsndfile as well (as a dependency of libsamplerate).
 +
 
 +
===Install sphinxbase-0.7===
 +
*download sphinxbase-0.7: http://sourceforge.net/projects/cmusphinx/files/sphinxbase/0.7/sphinxbase-0.7.tar.gz/download
 +
*configure, make, install:
 +
<pre>
 +
./configure --without-python
 +
make
 +
sudo make install
 +
</pre>
 +
 
 +
===Install pocketsphinx-0.7===
 +
*download pocketsphinx-0.7: http://sourceforge.net/projects/cmusphinx/files/pocketsphinx/0.7/pocketsphinx-0.7.tar.gz/download
 +
*configure, make, install:
 +
<pre>
 +
./configure --without-python
 +
make
 +
sudo make install
 +
</pre>
 +
 
 +
===Build pocketsphinx-psx-countinuous===
 +
*download: http://svn.roberttwomey.com/supercollider/pocketsphinx-osx/
 +
*build in xcode.
 +
*copy /build/Debug/pocketsphinx-osx-continuous to your binaries/ folder. this is the command line program used by the speechtool.scd program.
 +
== sphinx-openal on OS X ==
 +
*brew to install openal
 +
*build this - https://gitorious.org/code-dump/sphinx-openal
 +
 
 +
===OUT OF DATE: ===
 +
====Basic CMU Sphinx-4 Automatic Speech Recognition (ASR) library info ====
 +
*Reference Home http://cmusphinx.sourceforge.net/wiki/
 +
*download http://sourceforge.net/projects/cmusphinx/files/
 +
*to build 32 bit sphinx library, after you run <code>autogen.sh</code>, run <code> ./configure CFLAGS="-arch i386 -m32" LDFLAGS="-arch i386 -m32"</code>
 +
*training a SLM, http://www.speech.cs.cmu.edu/tools/lmtool-new.html
 +
 
 +
==== Compiling pocketsphinx as a universal static lib on OS X ====
 +
*make x86_64 version of libsphinxbase:
 +
<code>
 +
cd sphinxbase-0.7
 +
./configure
 +
make
 +
</code>
 +
*copy resulting <code>libsphinxbase.a</code> file from <code>/sphinxbase-0.7/src/libsphinxbase/.libs/</code> to <code>libsphinxbase.x86_84.a</code> in temp directory
 +
 
 +
*make x86_64 version of libspocketsphinx:
 +
<code>
 +
cd pocketsphinx-0.7
 +
./configure
 +
make
 +
</code>
 +
*copy resulting <code>libpocketsphinx.a</code> file from <code>/pocketsphinx-0.7/src/libpocketsphinx/.libs</code> to <code>libpocketsphinx.x86_64.a</code> file in temp directory
 +
 
 +
*make i386 versions of libsphinxbase:
 +
<code>
 +
export CFLAGS="-arch i386"
 +
export LDFLAGS="-arch i386"
 +
cd sphinxbase-0.7
 +
make clean
 +
./configure
 +
make
 +
</code>
 +
*copy resulting <code>libsphinxbase.a</code> file from <code>/sphinxbase-0.7/src/libsphinxbase/.libs/</code> to <code>libsphinxbase.i386.a</code> in temp directory
 +
 
 +
*make i386 versions of libpocketsphinx:
 +
<code>
 +
cd pocketsphinx-0.7
 +
make clean
 +
./configure
 +
make
 +
</code>
 +
*copy resulting <code>libpocketsphinx.a</code> file from <code>/pocketsphinx-0.7/src/libpocketsphinx/.libs</code> to <code>libpocketsphinx.i386.a</code> file in temp directory
 +
 
 +
*combine files with lipo
 +
<code>
 +
lipo -create -output libsphinxbase.a libsphinxbase.x86_64.a libsphinxbase.i386.a
 +
lipo -create -output libpocketsphinx.a libpocketsphinx.x86_64.a libpocketsphinx.i386.a
 +
</code>
 +
=== Building logios LM tools ===
 +
<code>
 +
  export CFLAGS="-I/usr/include/malloc"
 +
./configure
 +
make
 +
sudo make install
 +
</code>
  
 
==References==
 
==References==
Line 115: Line 423:
 
**http://www.terminartors.com/artworkprofile/Pope_L._William-White_Room_4_Wittgenstein_my_Brother_Frank
 
**http://www.terminartors.com/artworkprofile/Pope_L._William-White_Room_4_Wittgenstein_my_Brother_Frank
 
**http://www.rovetv.net/wp3-index.html
 
**http://www.rovetv.net/wp3-index.html
 +
*The Cut-Up Method of Brion Gysin.  William S. Burroughs. http://www.ubu.com/papers/burroughs_gysin.html
 +
**Burroughs was greatly influenced by Gysin but he was able to take it one step further: He considered it only one of his writing tools along with newspaper articles, his dreams, seemingly random thoughts and any kind of word in his immediate vicinity. He took the scrambled text and reworked it until he felt it said something. Burroughs believed that he was not the writer but a transcriber of what was already written. It was no trouble if the words he wrote weren't his, since no words belonged to any writer, just like colours don't belong to painters. http://www.23degrees.net/cutup/
 +
*Charles O. Hartman. Virtual Muse: Experiments in Computer Poetry. http://www.amazon.com/Virtual-Muse-Experiments-Computer-Wesleyan/dp/0819522392/ref=ntt_at_ep_dpt_2 / http://www.upne.com/0-8195-2238-4.html
 +
*Brief History of the Oulipo. Jean Lescure. In ''New Media Reader'', Noah Wardrip-Fruin, Nick Montfort 2003.
 +
*Dennis Oppenheim. "Color Application for Chandra." 1971.
 +
** My two-and-a-half-year-old daughter is taught seven basic colors by repeated exposure to projected light and to my voice. In three hours she is able to associate the color symbol with the word symbol, thereby acquiring this data. Individual tape loops of Chandra's voice repeating the color names are played twenty four hours a day to a parrot in a separate room. The parrot eventually learns to mimic the color names. Here, color is not directly applied to a surface, but transmitted (abstracted from its source) and used to structure the vocal responses of a bird. It becomes a method for me to throw my voice." (in Dennis Oppenheim: Selected works 1967-90 . Heiss. 1992)

Latest revision as of 17:46, 10 November 2012

<<< back to Wiki Home

pocket sphinx by hand

my pocketsphinx:

~/supercollider/solipsist/binaries/pocketsphinx-osx-continuous\
-hmm ~/supercollider/solipsist/data/models/hmm/en_US/hub4wsj_sc_8k -dict\
~/supercollider/solipsist/data/models/script/script.dic\
-lm ~/supercollider/solipsist/data/models/script/script.lm\
-infile robertrauschenberg1_rs.wav

default pocketsphinx:

pocketsphinx_continuous -hmm ~/supercollider/solipsist/data/models/hmm/en_US/hub4wsj_sc_8k\
-dict ~/supercollider/solipsist/data/models/script/script.dic\
-lm ~/supercollider/solipsist/data/models/script/script.lm\
-infile robertrauschenberg1_rs.wav

Proposal - Solipsist

I have been working with voice recognition technologies and Mel Bochner's text 'Serial Art, Systems, Solipsism', developing a device for performance and exchange between human and computer. The device consists of a microphone, speech recognition system, software, and a receipt printer. The format of the conversation is an dialog between human voice and printed receipts in text--the system transcribes (and validates) what it hears in terms of the words it knows. As is characteristic of voice recognition, face recognition, and other kinds of machine perception that operate within explicitly defined or statistically trained spaces of "perception", this is a solipsistic system, "denying the existence of anything outside the confines of its own mind" (Bochner, 1967). This character of the solipsist is one Mel Bochner evoked to describe the autonomy and denial of external reference in minimalist sculpture of the 1960s, but which I find particularly appropriate to describe current "smart" technologies--ultimately the agency of the system still comes down to whatever agency the programmers have embedded in it, a sort of ventriloquism. This idea of a closed, narrowly parameterized space of perception in the machine is an interesting model for (and contrast to) issues of language, vocabulary, and free expression in humans--an exploration I intend to pursue through this project.

There are multiple challenges in developing this piece as a performance/installation. The first and most concrete is to get a baseline speech recognition system working. I have implemented this in the fall using the Sphinx-4 speech recognition library in Processing and Java. I have also acquired a receipt printer, ribbon, and paper, and can control its printing behavior through a serial interface. Details are in the Code section. This is the "proof of concept".

The development of roles for two characters in this piece--the system and the participant--is necessary to create the kind of encounter I have in mind. On the one hand, I would like this project to investigate the strengths and limitations of the speech recognition technology through viewer interaction, and on the other hand I would like to create a psychological investigation which highlights our human propensity to project psychology onto inanimate things and to attribute intention to them. Explicit attention in constructing the roles of both performers (human and machine) and in framing the situation will tease out some of the interesting ideas in both of these domains.

Finally, I need to address sonic properties of the piece and its time course as a composition. The voice of the viewer as they speak into the microphone is one sound source in the system, and the receipt printer has a very assertive (and retro) dot-matrix-ey sound as it prints out text and cuts rolls of paper. I need to make some decisions about how to use the sounds of the speaker and the printer over the course of the piece. Also, will I add in additional sound sources such as more printers, pre-recorded voices, voices of past participants, or processed sounds? There are additional possibilities here for rhythmic exchanges between the percussive sound of the printer and the speaker, for long pauses, silences, and repetitions. Additionally, I need to establish some overall arc for the piece--does an encounter with the system travel through to one pre-ordained conclusion? Are there multiple branching possibilities that change depending on what the viewer says and how they respond to the printouts? Finally there is a relationship to be explored between speech as text and speech as sound--a parallel to the roles of printing as text and printing as sound. The fundamental distinctions between text, sound, and speech as kinds of communication and expression can be ripe territory for exploration. I suspect that these conceptual and compositional questions will occupy most of my time this quarter and comprise the bulk of the work that I need to do.

Where else do we do this sort of projection and anthropomorphism? (projecting psychology or attributing intention to non-intelligent systems)

The most obvious technical challenges I foresee at this point are the implementation of sound input and pre-processing with supercollider, and interfacing from supercollider to the speech recognition library and receipt printer. As part of this project is a critical investigation of the strengths and limitations of automatic speech recognition (ASR) technology, I intend to get more involved with the internal mechanisms of speech recognition as implemented in the Sphinx-4 library. A more comprehensive understanding of that technology is necessary to figure out how to tweak it and expose its internal character and assumptions.

I will update the weekly Progress section as the quarter continues.

Equipment

  • Hardware:
    • receipt printer
    • microphone
    • motu box.
    • mac mini.
    • desk*
    • desk lamp*
    • sound-proof commercial glazing (windows)*
  • Software:

*I do not have these items yet.

Timeline

Week 1 - 2

Introduction to course and project development.


Week 3 - Proposal - 4/12

  1. Write this.
  2. Meet with Juan.
  3. Find desk and desk lamp for "me vs. the computer" staging of microphone/speech recognition system.

Week 4

Work time.

Week 5 - MILESTONE 1

Working model of each of two one tracks:

  1. Participant speaking to computer.
  2. Computer/printer speaking to itself (feedback loop). Interpreting printer sounds as speech. Or transforming them into speech.

Week 6

Realize that the best approach will combine elements of each of the two tracks above.

Week 7 - MILESTONE 2

  • Experiments with the characterization of the system:
    • Software agent? agency. towards what goals?
    • Interruptions.
    • Unexpected responses.
    • Basic state modeling (emotional states, psychological states).
    • Basic drive modeling (for novelty, entertainment, activity, rest, conversation on certain topics).
  • Attention to possible choices of text.

I imagine these two go hand in hand--that the choice of particular texts will lend much of the character to the piece.

Week 8

Have others experience the system, try it out.

Week 9 - MILESTONE 3

Near-final form, near-final realization.

Viewer interaction tests.

Week 10

Final changes, improvements, last minute blitz.

Presentation

Progress

Week 2

get system running again

Week 3

  • Got the system running again. The hard to find OS X driver for my Airlink101 AC-USBS (Serial Adapter) was actually available from Prolific: http://www.prolific.com.tw/eng/downloads.asp?id=31. I think the airlink device must have the Prolific PL-2303 USB to I/O Port Controller inside. Finding a driver probably wouldn't be difficult if I bought some more recently manufactured usb-to-serial adapter, if people still manufacture those sorts of things. I need a serial adapter because the epson receipt printer has some old school serial connectivity on the back.
  • Met with Juan.

Week 4

Week 5 - Milestone 1

Done:

  • working model
  • bidirectional OSC communication between processing/supercollider
  • some planning

Pictures (on my computer)

To Do - Conceptual:

  • end state? (criteria for cut)
  • changing identities?
  • learn from participants? (if so, what model/vocabulary am I matching against, e.g. WSJ4K, TI_DIGITS, custom vocab?)
  • bored behavior? (when no-one is in the room)

To Do - Technical:

  • folding camp/travel table.
  • speech classifier in supercollider
  • save out audio files of speech sound
  • responder. state modeling? what language?
  • sound isolation possibilities. multi-pane commercial glazing would be my preference, though it is expensive I'm sure.

Week 6

Pocketsphinx:

Language Models:

Sphinx Website Stuff:

Articles:

Week 7 - Milestone 2

In Class Discussion:

put a coin to purchase the receipt
occupying roles
master script: stelios
learning from its own output: nico
that's what I said: juan
more than just the filter: juan
some analysis of the text
reconfigure the text
what's the thing about the receipt printer?: nico
coin-op justifies the receipt.  commercial transaction
record of a transaction
oracle, coin-op oracle: nico
trimpin, coin-op sculpture: juan
automatic poetry generator
ask people to engage poetically with it, record them
the idea of the residual, the parts/things that people say that are outside of the system (it's closed language-field)
site-specificity: administrative  scout out sites with speech rec system.
not be so one-dimensional, analyze tonality, intonation, rate.  profiling affect.

Week 8

Josh P. suggests white rabbit code (libsamplerate) to do my supercollider -> 16KHz (sphinx) downsampling. The following three libraries can be built and installed:

  1. pkg-config 0.23 builds on OS X: http://mac.softpedia.com/progDownload/pkg-config-Download-46896.html
  2. libsndfile: http://www.mega-nerd.com/libsndfile/#Download
  3. libsamplerate: http://www.mega-nerd.com/SRC/download.html

Convert audio file to 16-bit little-endian audio raw:

  • ffmpeg -i solipsism1.wav -acodec pcm_s16le -ac 1 -ar 16000 solipsism1_pcm_s16le.wav
  • cmu07a.dic

Later in the week:

Week 9 - Milestone 3

To Do:

  • finish connecting up components (e.g. SC to pocketsphinx and back)
  • enable self-modifying grammar/Language Model
  • Updated pocketsphinx-osx code to accept command line arguments:
    • run recognition with a grammar: ./pocketsphinx-osx-continuous -jsgf solipsist.gram -infile solipsism1_pcm_s16le.wav
    • run recognition with a custom language model: ./pocketsphinx-osx-continuous -dict 9316.dic -lm 9316.lm

Week 10

Allright! I made much progress on my speechtool supercollider code. (http://svn.roberttwomey.com/supercollider/supercollider/speechtool.scd)

New features:

  • It is relatively sensitive to sound levels and does a decent job of detecting possible speech above background noise.
  • It correctly saves out raw time-stamped audio files to ~/Sounds/SpeechFiles/ of all sounds detected and sent to pocket-sphinx
  • It calls sndfile-resample to resample 44.1KHz audio files to 16KHz for pocket-sphinx.
  • It runs recognition with pocketsphinx and prints results to the post window in the format Recognized: ....
  • It saves a text logfile in ~/Sounds/SpeechFiles/logs with the DateStamped .wav file name and the recognized string.
  • It now works with a JSGF grammar rather than a statistical language model. Example command line options:
  • With an SLM from online tool:
    • ./pocketsphinx-osx-continuous -dict /Users/rtwomey/Documents/dxarts463_sp11/mmpi-2/9316.dic -lm /Users/rtwomey/Documents/dxarts463_sp11/mmpi-2/9316.lm
  • With an SLM from local tools:
    • ./pocketsphinx-osx-continuous -lm /Users/rtwomey/Documents/dxarts463_sp11/mmpi2_slm/mmpi2.arpa

Info on default models:

  • HMM (acoustic?) hub4wsj_sc_8k (hmm/en_US/hub4wsj_sc_8k)
  • Dictionary: cmu07a.dic, 133436 words (lm/en_US/cmu07a.dic)
  • Filler dictionary: noisedict (hmm/en_US/hub4wsj_sc_8k/noisedict)
  • LM... either hub4.5000.DMP or wsj0vp.5000.DMP, I am guessing. (?)

To Do:

  • Field recordings with baseline vocabulary and language models. (I think this is the WSJ5K)
  • Test with different LMs or Grammars
  • Make more decisions.

Finals Week

Rhyming + Homophonic sounds:

Building a Grammar:

Building a Language Model:

Generating a Dictionary:

  • http://cmusphinx.sourceforge.net/wiki/tutorialdict
  • from command line: perl make_pronunciation.pl -tools /Users/rtwomey/code/cmusphinx/trunk/logios/Tools/ -dictdir /Users/rtwomey/code/cmusphinx/trunk/logios/Tools/MakeDict/lib/dict -words /Users/rtwomey/Documents/dxarts463_sp11/mmpi2_slm/mmpi2.tmp.vocab -handdict NONE -dict results.dic

Language World:

Language Models:

Grammars:

Final

Tar File

portable_solipsist.tar

contains supercollider code, os x binaries, data files, etc. see README inside for more info.

Report

coming wed. am.

Code

Processing

Supercollider

Pocketsphinx command-line Recognizer in OS X

Build a Language Model Online

Build a Language Model Locally

  • see week 10 and final week above... not yet done

Secret Rabbit Code (libsamplerate)

How To Compile on OS X

Download and Install Homebrew

from https://github.com/mxcl/homebrew/wiki/installation

 /usr/bin/ruby -e "$(curl -fsSL https://raw.github.com/gist/323731)" 

Install libsamplerate

 brew install libssamplerate 

this should install libsndfile as well (as a dependency of libsamplerate).

Install sphinxbase-0.7

./configure --without-python
make
sudo make install

Install pocketsphinx-0.7

./configure --without-python
make
sudo make install

Build pocketsphinx-psx-countinuous

sphinx-openal on OS X

OUT OF DATE:

Basic CMU Sphinx-4 Automatic Speech Recognition (ASR) library info

Compiling pocketsphinx as a universal static lib on OS X

  • make x86_64 version of libsphinxbase:

cd sphinxbase-0.7
./configure
make

  • copy resulting libsphinxbase.a file from /sphinxbase-0.7/src/libsphinxbase/.libs/ to libsphinxbase.x86_84.a in temp directory
  • make x86_64 version of libspocketsphinx:

cd pocketsphinx-0.7
./configure
make

  • copy resulting libpocketsphinx.a file from /pocketsphinx-0.7/src/libpocketsphinx/.libs to libpocketsphinx.x86_64.a file in temp directory
  • make i386 versions of libsphinxbase:

export CFLAGS="-arch i386" 
export LDFLAGS="-arch i386"
cd sphinxbase-0.7
make clean
./configure
make

  • copy resulting libsphinxbase.a file from /sphinxbase-0.7/src/libsphinxbase/.libs/ to libsphinxbase.i386.a in temp directory
  • make i386 versions of libpocketsphinx:

cd pocketsphinx-0.7
make clean
./configure
make

  • copy resulting libpocketsphinx.a file from /pocketsphinx-0.7/src/libpocketsphinx/.libs to libpocketsphinx.i386.a file in temp directory
  • combine files with lipo

lipo -create -output libsphinxbase.a libsphinxbase.x86_64.a libsphinxbase.i386.a
lipo -create -output libpocketsphinx.a libpocketsphinx.x86_64.a libpocketsphinx.i386.a

Building logios LM tools

 export CFLAGS="-I/usr/include/malloc"
./configure
make
sudo make install

References