Beyond the Every Day: Vocal Potential in AI Mediated Communication
In summer 2021, sound artist, engineer, musician, and educator Johann Diedrick convened a panel at the intersection of racial bias, listening, and AI technology at Pioneerworks in Brooklyn, NY. Diedrick, 2021 Mozilla Creative Media award recipient and creator of such works as Dark Matters, is currently working on identifying the origins of racial bias in voice interface systems. Dark Matters, according to Squeaky Wheel, “exposes the absence of Black speech in the datasets used to train voice interface systems in consumer artificial intelligence products such as Alexa and Siri. Utilizing 3D modeling, sound, and storytelling, the project challenges our communities to grapple with racism and inequity through speech and the spoken word, and how AI systems underserve Black communities.” And now, he’s working with SO! as guest editor for this series for Sounding Out! (along with ed-in-chief JS!). It starts today, with Amina Abbas-Nazari, helping us to understand how Speech AI systems operate from a very limiting set of assumptions about the human voice– are we training it, or is it actually training us?
Hi, good morning. I’m calling in from Bangalore, India.” I’m talking on speakerphone to a man with an obvious Indian accent. He pauses. “Now I have enabled the accent translation,” he says. It’s the same person, but he sounds completely different: loud and slightly nasal, impossible to distinguish from the accents of my friends in Brooklyn.The AI startup erasing call center worker accents: is it fighting bias – or perpetuating it? (Wilfred Chan, 24 August 2022)
This telephone interaction was recounted in The Guardian reporting on a Silicon Valley tech start-up called Sanas. The company provides AI enabled technology for real-time voice modification for call centre workers voices to sound more “Western”. The company describes this venture as a solution to improve communication between typically American callers and call centre workers, who might be based in countries such as Philippines and India. Meanwhile, research has found that major companies’ AI interactive speech systems exhibit considerable racial imbalance when trying to recognise Black voices compared to white speakers. As a result, in the hopes of being better heard and understood, Google smart speaker users with regional or ethnic American accents relay that they find themselves contorting their mouths to imitate Midwestern American accents.
These instances describe racial biases present in voice interactions with AI enabled and mediated communication systems, whereby sounding ‘Western’ entitles one to more efficient communication, better usability, or increased access to services. This is not a problem specific to AI though. Linguistics researcher John Baugh, writing in 2002, describes how linguistic profiling is known to have resulted in housing being denied to people of colour in the US via telephone interactions. Jennifer Stoever‘s The Sonic Color Line (2016) presents a cultural and political history of the racialized body and how it both informed and was informed by emergent sound technologies. AI mediated communication repeats and reinforces biases that pre-exist the technology itself, but also helping it become even more widely pervasive.
Mozilla’s commendable Common Voice project aims to ‘teach machines how real people speak’ by building an open source, multi-language dataset of voices to improve usability for non-Western speaking or sounding voices. But singer and musicologist, Nina Sun Eidsheim describes how ’a specific voice’s sonic potentiality [in] its execution can exceed imagination’ (7), and voices as having ‘an infinity of unrealised manifestations’ (8) in The Race of Sound (2019). Eidsheim’s sentiments describe a vocal potential, through musicality, that exists beyond ideas of accents and dialects, and vocal markers of categorised identity. As a practicing vocal performer, I recognise and resonate with Eidsheim’s ideas I have a particular interest in extended and experimental vocality, especially gained through my time singing with Musarc Choir and working with artist Fani Parali. In these instances, I have experienced the pleasurable challenge of being asked to vocalise the mythical, animal, imagined, alien and otherworldly edges of the sonic sphere, to explore complex relations between bodies, ecologies, space and time, illuminated through vocal expression.
Following from Eidsheim, and through my own vocal practice, I believe AI’s prerequisite of voices as “fixed, extractable, and measurable ‘sound object[s]’ located within the body” is over-simplistic and reductive. Voices, within systems of AI, are made to seem only as computable delineations of person, personality and identity, constrained to standardised stereotypes. By highlighting vocal potential, I offer a unique critique of the way voices are currently comprehended in AI recognition systems. When we appreciate the voice beyond the homogenous, we give it authority and autonomy, ultimately leading to a fuller understanding of the voice and its sounding capabilities.
My current PhD research, Speculative Voicing, applies thinking about the voice from a musical perspective to the sound and sounding of voices in artificially intelligent conversational systems. Herby the voice becomes an instrument of the body to explore its sonic materiality, vocal potential and extremities of expression, rather than being comprehended in conjunction to vocal markers of identity aligning to categories of race, gender, age, etc. In turn, this opens space for the voice to be understood as a shapeshifting, morphing and malleable entity, with immense sounding potential beyond what might be considered ordinary or everyday speech. Over the long term this provides discussion of how experimenting with vocal potential may illuminate more diverse perspectives about our sense of self and being in relation to vocal sounding.
Vocal and movement artist Elaine Mitchener exhibits the disillusion of the voice as ‘fixed’ perfectly in her performance of Christian Marclay’s No!, which I attended one hot summer’s evening at the London Contemporary Music Festival in 2022. Marclay’s graphic score uses cut outs from comic book strips to direct the performer to vocalise a myriad of ‘No”s.
Mitchener’s rendering of the piece involved the cooperation and coordination of her entire body, carefully crafting lips, teeth, tongue, muscles and ligaments to construct each iteration of ‘No.’ Each transmutation of Mitchener’s ‘No’s’ came with a distinct meaning, context, and significance, contained within the vocalisation of this one simple syllable. Every utterance explored a new vocal potential, enabled by her body alone. In the context of AI mediated communication, we can see this way of working with the voice renders the idea of the voice as ‘fixed’ as redundant. Mitchener’s vocal potential demonstrates that voices can and do exist beyond AI’s prescribed comprehension of vocal sounding.
In order to further understand how AI transcribes understandings of voice onto notions of identity, and vocal potential, I produced the practice project Polyphonic Embodiment(s) as part of my PhD research, in collaboration with Nestor Pestana, with AI development by Sitraka Rakotoniaina. The AI we created for this project is based upon a speech-to-face recognition AI that aims to be able to tell what your face looks like from the sound of your voice. The prospective impact of this AI is deeply unsettling, as its intended applications are wide-ranging – from entertainment to security, and as previously described AI recognition systems are inherently biased.
This multi-modal form of comprehending voice is also a hot topic of research being conducted by major research institutions including Oxford University and Massachusetts Institute of Technology. We wanted to explore this AI recognition programme in conjunction with an understanding of vocal potential and the voice as a sonic material shaped by the body. As the project title suggests, the work invites people to consider the multi-dimensional nature of voice and vocal identity from an embodied standpoint. Additionally, it calls for contemplation of the relationships between voice and identity, and individuals having multiple or evolving versions of identity. The collaboration with the custom-made AI software creates a feedback loop to reflect on how peoples’ vocal sounding is “seen” by AI, to contest the way voices are currently heard, comprehended and utilised by AI, and indeed the AI industry.
The video documentation for this project shows ‘facial’ images produced by the voice-to-face recognition AI, when activated by my voice, modified with simple DIY voice devices. Each new voice variation, created by each device, produces a different outputted face image. Some images perhaps resemble my face? (e.g. Device #8) some might be considered more masculine? (e.g. Device #10) and some are just disconcerting (e.g. Device #4). The speculative nature of Polyphonic Embodiment(s) is not to suggest that people should modify their voices in interaction with AI communication systems. Rather the simple devices work with bodily architecture and exaggerate its materiality, considering it as a flexible instrument to explore vocal potential. In turn this sheds light on the normative assumptions contained within AI’s readings of voice and its relationships to facial image and identity construction.
Through this artistic, practice-led research I hope to evolve and augment discussion around how the sounding of voices is comprehended by different disciplines of research. Taking a standpoint from music and design practice, I believe this can contest ways of working in the realms of AI mediated communication and shape the ways we understand notions of (vocal) identity: as complex, fluid, malleable, and ultimately not reducible to Western logics of sounding.
Featured Image: Still image from Polyphonic Embodiments, courtesy of author.
Amina Abbas-Nazari is a practicing speculative designer, researcher, and vocal performer. Amina has researched the voice in conjunction with emerging technology, through practice, since 2008 and is now completing a PhD in the School of Communication at the Royal College of Art, focusing on the sound and sounding of voices in artificially intelligent conversational systems. She has presented her work at the London Design Festival, Design Museum, Barbican Centre, V&A, Milan Furniture Fair, Venice Architecture Biennial, Critical Media Lab, Switzerland, Litost Gallery, Prague and Harvard University, America. She has performed internationally with choirs and regularly collaborates with artists as an experimental vocalist
REWIND! . . .If you liked this post, you may also dig:
What is a Voice?–Alexis Deighton MacIntyre
Mr. and Mrs. Talking Machine: The Euphonia, the Phonograph, and the Gendering of Nineteenth Century Mechanical Speech – J. Martin Vest
One Scream is All it Takes: Voice Activated Personal Safety, Audio Surveillance, and Gender Violence—María Edurne Zuazu
Echo and the Chorus of Female Machines—AO Roberts
On Sound and Pleasure: Meditations on the Human Voice– Yvon Bonefant
The Sound of What Becomes Possible: Language Politics and Jesse Chun’s 술래 SULLAE (2020)
“To this day I think about all the strange words I missed out on, all the losses I’m still carrying from faraway…I still think of the time when I spoke one language, and that language was whole.”Chun 2020
Language can be a site of loss, a wholeness with which one, due to migration, has never really known. In the above passage, artist, Jesse Chun, reflects on how her grandmother spoke words in a language she did not understand, but yearned to hear and feel those sounds after her passing. There is a sonic residue that sticks to diasporic experiences. There are sounds that can stir up a blend of affect and ideation that is comforting when whiteness is unsettling. It is this disjuncture between words, meaning, and their sounds, that drew me to Chun’s work, 술래 SULLAE (2020). This piece reminded me of how sound, in its most ambiguous and queer forms, can hold the contingencies of history, language, memory, family, and the genealogies of loss that mark these sites of colonial dispossession.
술래 SULLAE (2020) is a single channeled video that draws from ganggang sullae, a Korean seasonal harvest and fertility ritual that integrates song and dance and is typically performed by women under the glow of moonlight. The participants hold hands forming a circle that through their movement, expands, disassembles, and changes its form. The songs can be both impromptu or pre-determined and encourages the participants to express their feelings in chorus with one another.
Diana Seo Hyung Lee (2020) suggests that historically ganggang sullae was meant to provide a forum for its participants to express emotions connected to living within patriarchal systems of power and oppression. She writes: “the women participating would not have been able to, in their everyday lives, sing, speak loudly, nor leave the house at night, in the patriarchal society of ancient Korea. This dance was a license for their one release.” In 술래 SULLAE, the dance proves to be a defiant presence. The women flash on screen as an unbreakable chain reinscribing a gendered history with new sounds and images that gesture to emancipatory possibilities.
술래 SULLAE combines archival clips of ganggang sullae, index pages from intonation books, images of Hangul and English consonants and audio splices from YouTube tutorials on how to pronounce English correctly. In the video, language becomes unhinged from expectation but at the same time, given form through history. The sound of the English language is disembodied and spliced into phonemic pulses. In 술래 SULLAE, Chun has created an encounter with the grammars of polyphony; a simultaneity of sounds that are both restrained by and resistant to the imposition of English on the Korean diaspora. Through what Chun has described as a form of “unlanguaging” following Rey Chow, her audience is witness to new meanings produced through the abstraction, manipulation, and redaction of sounds and symbols from the English language.
Chun’s editing and manipulation of English sounds is intentional. In an interview with Art Forum, Chun shares: “Taking the sound apart but still keeping it within the conceptual framework of English made me think about what else is embedded in making a language. English is tied up with legacies of imperialism; there’s so much unseen violence that is part of how this language is institutionalized.” What remains after the edits is an inventory of sounds that disrupts the primacy of the vowel as central to English word construction and thus, central to colonial imagination.
Like Chun, I realize that my conceptualizing of language is within an English framework, but my hope is that when we turn to the affective and when we begin to pull language a part, something different, something resistant, is produced. I am neither an expert in English nor Korean linguistics, it was the sounds in this work that pulled me into it. In thinking with 술래 SULLAE, I’m interested in what becomes possible in the absence of the vowel. I turn to the interruptive potential of consonant sounds to affect and incite methods of communication outside of those steeped in colonial dominance. What does it mean to de-emphasize the function of vowel sounds in language and reorient our listening to the consonant? What do consonant sounds teach us about the sonics of race that underwrite hierarchies of language? What methods of communication become possible when we do away with words and are left with only their sonic substance?
Through her assemblage of consonant sounds in 술래 SULLAE, Chun is making a deliberate choice to describe and animate a politics of language through refusing its colonial enclosures and turning to the aesthetic in order hold the excesses of description. She refuses the vowel in this piece, not by denying its presence, but instead relegating it to the soundless and the unfamiliar, a space of, in her words, “untranslatability.” In this undoing, consonants become the emotive force where new meanings and orientations to the sounds that mark our words are forged.
술래 SULLAE opens with the sound ssshhh; a pairing of consonant sounds that is often associated with insisting on silence, a sound meant to reprimand. Chun extracts and emplaces this sound in a new aesthetic landscape that is independent and unregulated by colonial schemas of enunciation and translation. The prominent soundscapes of the video are consonant sounds and when removed from their phonetic relations to vowels these sounds undo the presumptive structuring or potential reprimand of English. In 술래 SULLAE, we are meant to experience the fullness of the consonants’ timbre…ssshhh, ppp, ddd, tttt, kkkk…these edited clips of sound originally meant to instruct and assimilate speech into English pronunciations now serve a different function. For me, they secure Chun’s political orientation: one that is about the crafting of a world that involves the careful consideration of the logistics, function, and embedded emotions of the sounds that inhabit it.
All languages contain their own unique set of vowels and consonants, but, Anne Carson reminds us that: “The importance of vowels to human speech has remained. There are words in English without consonants, but so central are vowels to word construction that there isn’t a word in English that doesn’t include a vowel.” In speech, consonants sounds are meant to break up the intended agenda of vowels. The ssshhh, ppp, ddd, tttt, kkkk, are antithetical to the circle or the rounded mouth needed to voice a vowel sound. Unlike the openness of a vowel, producing consonant sounds involves a narrowing of the vocal tract. This narrowing is referred to as constriction or the obstruction of breath whereby sound is produced by a form of corporeal tension. Consonant sounds also demand all the mechanics of the mouth: the lips, the teeth, the tongue, and the palette. Shhhh, requires the corners of the lips to lower and rather than rounding, the lips become pursed, and teeth become exposed. Parts of the mouth are drawn in. The soft palate is raised, and the tongue reaches upwards towards the roof of the mouth without touching it and then the tip of the tongue lowers behind the teeth.
Consonants emerge out of collectivity. Where a vowel is sounded without vocal constraint, consonants require more effort. Their sounds are produced through intricate bodily choreographies in the mouth that involve both constriction and collaboration. Ganggangsullae likewise relies on effort and interdependence. Participants collectively determine the speed and/or shape of their dance. They may even become serpentine or separate into smaller circles depending on what the group decides. The dance also provides an aesthetic space for its participants to voice frustration, anger, and tension through song with the hopes of producing reprieve from gendered hardships. Chun has decided to withhold these songs from her audience; we never hear the women singing. Through this erasure, Chun embeds the consonant sound with affective force whereby a politics of language and gendered presence is enunciated through and beyond a form of silencing. The dance redirects trajectories of dominance whereby the shushing takes on a new voice imbued with agency and hope. Because of how Chun isolates and amplifies its sound, ssshhhh is free to take on different meanings and associations. For me, I was reminded of rushing water or gusts of wind, or the sound used to lull my child to sleep. I was brought into another index of knowing and relating.
The sounds of language hold erasures and layered histories often obfuscated by our mundane encounters with them. Largely understood as the most sonorous part of the syllable, vowels produce the loudest speech sounds and their capacity for holding larger amplitudes or louder volumes have been linked to the sonic expression of emotion. Consonant sounds are more pragmatic than vowels. They are known for their functionality, for the ways in which they assemble the semantic structure of words and for their capacities to hold vowels in place or as Anne Carson describes as “delineating meaning amid the flow of open vowel sounds.” Consonant and vowel sounds map out different functional trajectories by virtue of the shape of mouth and orientation of breath that these sounds demand. Like Chun, I’m interested in what political orientations become possible when we source emotion elsewhere, beyond the confines of spoken words imposed upon us.
The word consonant is a noun, a word used to identify or classify, a semantic enclosure that establishes a subject or object. But unlike the word vowel, consonant is also an adjective. A consonant possesses the capacity to describe, to name, to tell us more. Adjectives parcel out description on states of being, in this way, they are inherently phenomenological. In 술래 SULLAE, Chun empties vowels of their sonic substance leaving behind traces of fragmented characters and differently shaped circles in their wake. They are stripped of breath and their symbolic value forming a new method of communication that reroutes expectations of what language, as we know it, can do and sound like. Like, ganggang sullae, the vowel is premised on the shape of a circle, but in 술래 SULLAE, Chun provokes us to think about what becomes possible beyond the circular structuring device, what becomes possible beyond the purview of the violent embeddedness of English and its colonial exigencies.
Chun has noted that the moon that hovers above the ganggangsullae is yet another site of imperial conquest. In Art Forum, Chun states: “when I look up at it to feel comforted or to find solace, I’m reminded of colonial violence and an agenda that’s projected onto it. In that way, the moon also reflected how I see language.” Chun’s turn to consonants signals a reshaping of the colonial frame that does not disavow or idealize the legacies of imperialism on systems of communication, but instead highlights the tensions and obstructions produced in its shadows.
Featured image: 술래 SULLAE, 2020, single-channel version, courtesy of artist
Casey Mecija is an Assistant Professor in the Department of Communication & Media Studies at York University. Her current research examines sound as a mode of affective, psychic and social representation, specifically in relation to diasporic experience. Drawing on sound studies, queer diaspora studies and Filipinx Studies, her research considers how sensorial encounters are enmeshed and disciplined by social and psychic conditions. She is also a musician and filmmaker, whose work has received a number of accolades and has been presented internationally.
REWIND! . . .If you liked this post, you may also dig:
Blank Space and “Asymmetries of Childhood Innocence” –Casey Mecija
Re-orienting Sound Studies’ Aural Fixation: Christine Sun Kim’s “Subjective Loudness”–Sarah Mayberry Scott
Tape Hiss, Compression, and the Stubborn Materiality of Sonic Diaspora–Chris Chien