I first heard about voice donation while listening to “Being Siri,” an experimental audio piece about Erin Anderson donating her voice to Boston-based voice donation company, VocaliD. Like a digital blood bank of sorts, VocaliD provides a platform for donating one’s voice via digital audio recordings. These recordings are used to help technicians create a custom digital voice for a voiceless individual, providing an alternative to the predominately white, male, mechanical-sounding assistive technologies used by people who cannot vocalize for themselves (think Stephen Hawking). VocaliD manufactures voices that better match a person’s race, gender, ethnicity, age, and unique personality. To me, VocaliD encapsulates the promise, complexity, and problematic nature of our current speech AI landscape and serves as an example of why we need to think critically about sound technologies, even when they appear to be wholly beneficial.
Given the extreme lack of sonic diversity in vocal assistive technologies, VocaliD provides a critically important service. But a closer look at both the rhetoric used by the organization and the material process involved in voice donation also amplifies the limits of overly simplistic, human-centric conceptions of voice. For instance, VocaliD rhetorically frames their service by persistently linking voice to humanity—to self, authenticity, individuality. Consider the following statements made by Rupal Patel, CEO and founder of VocaliD, in which she emphasizes the need for voice donation technology:
These are just a few examples from a larger discourse that reinforces the connection between voice and humanity. VocaliD’s repeated claims that their unique vocal identities humanize individuals imply that one is not fully human unless one’s voice sounds human. This rhetoric positions voiceless individuals as less than human (at least until they pay for a customized human-sounding voice).
VocaliD’s conflation of voice and humanity makes me wonder about the meaning of “human” in this context. For example, notions of humanity have been historically associated with Western whiteness—and deployed as a means of separating or distinguishing white people from Others—as Alexander Weheliye points out. Though VocaliD’s mission is to diversify manufactured voices, is a “human-sounding” voice still construed as a white voice? Does sounding human mean sounding white? Even if there is a bank of sonically diverse voices to choose from, does racial bias show up in the pacing, phrasing, or inflection caused by the vocal technology?
I am also disturbed by the rhetoric of humanity and individuality used by VocaliD because the company adopts the same rhetoric to describe the AI voices they sell to brands for media and smart products. Here’s an example of this rhetoric from the VocaliD AI website: “When you need a voice that resonates, evokes audience empathy, and sounds like you, rather than your competitors, VocaliD’s AI-powered vocal persona is the solution. Your voice — always on, where you need it when you need it.” Using similar rhetorical strategies to describe both voiceless people and products is dehumanizing. And yet, having a more diverse AI vocal mediascape, especially in terms of race, is crucially important since voice-activated machines and products are designed largely by white men who end up reinforcing the sonic color line.
Interestingly, the processes VocaliD uses to create a custom voice reveal that these voices are not, in fact, unique markers of humanity or individuality. It’s hard to find a detailed account of how VocaliD voices are made due to the company’s patents, but here are the basics: VocaliD does not transfer a donated voice directly to a voiceless person’s assistive technology. VocaliD technicians instead blend and digitally manipulate the donated voice with recordings of the noises a voiceless person can make (a laugh, a hum) to create a distinct new voice for the recipient. In other words, donated voices are skillful remixes that wouldn’t be possible without extracting vocal data and manipulating it with digital tools. Despite perpetuating narratives about voice, humanity, and authenticity, VocaliD’s creative blending of vocal material reveals that donated voices are the result of compositional processes that involve much more than people.
Further, considering VocaliD voices from a material rather than human-centric perspective amplifies something important about voices in general. All voices are composed of and grounded in an ecology. That is, voices emerge and are developed through a mixture of: (1) biological makeup (or technological makeup in the case of machines with voices); (2) specific environments and contexts (geography may determine the kind of accents humans have; AI voices have distinct sounds for their brands); (3) technologies (phones, computers, digital recorders and editors, software, and assistive technologies preserve, circulate, and amplify voices); and (4) others (humans often emulate the vocal patterns of the people they interact with most; many machine voices also sound like other machine voices). Put simply, all voices are intentionally and unintentionally composed over time—shaped by ever-changing bodily (and/or technological) states and engagements with the world. Voices are dynamic compositions by nature. Examining voice from a material standpoint shows that voices are not static markers of humanity; voices are responsive and malleable because they are the result of a complex ecology that involves much more than a “unique” human being.
However, focusing solely on the material aspects of vocality leaves out people’s lived experiences of voice. And based on online videos of VocaliD recipients—like Delaney, a seventeen-year-old with cerebral palsy—VocaliD voices seem to live up to the company’s hype. Delaney appears delighted by her new voice, stating: “I was so excited to get my own voice. I used to have a computer voice and now I sound like a girl. I like that. And I talk more.” Delaney’s teachers also discuss how her new voice completely changed her demeanor. Whereas before Delaney was reluctant to use her assistive technology to speak, her new voice gives her confidence and a stronger sense of identity. As her teacher explains in the video, “she is really engaged in groups, she wants to share her answers, she’s excited to talk with friends. It’s been really nice to see.” For Delaney, a VocaliD voice represents a newfound sense of agency.
It’s important to recognize this video is not necessarily representative of every VocaliD recipient’s experience, or even Delaney’s full experience. As Meryl Alper notes in Giving Voice, these types of news stories “portray technology as allowing individuals to ‘overcome’ their disability as an individual limitation, and are intended to be uplifting and inspirational for able-bodied audiences” (27). While we should be wary of the technological determinism in the video, observing Delaney use her VocaliD voice—and listening to the emotional responses of her mom and teachers—makes it difficult to deny that donated voices make a positive impact. For me, this video also gets at a larger truth about humans and voice: the ways we hear and understand our own voices, and the ways others interpret the sounds of our voices, matter a great deal. Voices are integral to our identities—to the ways we understand and think about ourselves and others—and the sounds of our voices have social and material consequences, as the SO! Gendered Voices Forum illustrates so clearly.
It’s worth repeating that VocaliD’s mission to diversify synthetic voices is incredibly important, especially given the restrictive vocal options available to voiceless individuals. It’s also necessary to acknowledge the company has limitations that end up reproducing the structural inequities it tries to address. As Alper observes, “In order to become a speech donor, one must have three to four hours of spare time to record their speech, access to a steady and strong Internet connection, and a quiet location in which to record” (162-63). With these obstacles to donating one’s voice in mind, it’s not surprising that all the VocaliD recipient videos I could find feature white people. Donating one’s voice is much easier for middle to upper class white people who have access to privacy, Internet, and leisure time.
This brief examination of VocaliD raises questions about what a more equitable future for vocal technologies might look/sound like. Though I don’t have the answer, I believe that to understand the fullness of voice, we can’t look at it from a single perspective. We need to account for the entire vocal ecology: the material (biological, technological, financial, etc.) conditions from which a voice emerges or is performed, and individual speakers’ understanding of their culture, race, ethnicity, gender, class, ability, sexuality, etc. An ecological approach to voice involves collaborating with people and their vocal needs and desires—something VocaliD models already. But it also involves accounting for material realities: How might we make the barriers preventing a more diverse voice ecosystem less difficult to navigate—especially for underrepresented groups? In short, we must treat voice holistically. Voices are more than people, more than technologies, more than contexts, more than sounds. Understanding voice means acknowledging the interconnectedness of these things and how that interconnectedness enables or precludes vocal possibilities.
Featured image: 366-350 You can’t shut me up, Jennifer Moo, CC BY-ND
Steph Ceraso is an associate professor of digital writing and rhetoric at the University of Virginia. Her 2018 book, Sounding Composition: Multimodal Pedagogies for Embodied Listening, proposes an expansive approach to teaching with sound in the composition classroom. She also published a digital book in 2019 called Sound Never Tasted So Good: ‘Teaching’ Sensory Rhetorics—an exploration of writing, sound, rhetoric, and food. She is currently working on a book project that examines sonic forms of invention in various contexts.
REWIND! . . .If you liked this post, you may also dig:
What is a Voice?–Alexis Deighton MacIntyre
“To this day I think about all the strange words I missed out on, all the losses I’m still carrying from faraway…I still think of the time when I spoke one language, and that language was whole.”Chun 2020
Language can be a site of loss, a wholeness with which one, due to migration, has never really known. In the above passage, artist, Jesse Chun, reflects on how her grandmother spoke words in a language she did not understand, but yearned to hear and feel those sounds after her passing. There is a sonic residue that sticks to diasporic experiences. There are sounds that can stir up a blend of affect and ideation that is comforting when whiteness is unsettling. It is this disjuncture between words, meaning, and their sounds, that drew me to Chun’s work, 술래 SULLAE (2020). This piece reminded me of how sound, in its most ambiguous and queer forms, can hold the contingencies of history, language, memory, family, and the genealogies of loss that mark these sites of colonial dispossession.
술래 SULLAE (2020) is a single channeled video that draws from ganggang sullae, a Korean seasonal harvest and fertility ritual that integrates song and dance and is typically performed by women under the glow of moonlight. The participants hold hands forming a circle that through their movement, expands, disassembles, and changes its form. The songs can be both impromptu or pre-determined and encourages the participants to express their feelings in chorus with one another.
Diana Seo Hyung Lee (2020) suggests that historically ganggang sullae was meant to provide a forum for its participants to express emotions connected to living within patriarchal systems of power and oppression. She writes: “the women participating would not have been able to, in their everyday lives, sing, speak loudly, nor leave the house at night, in the patriarchal society of ancient Korea. This dance was a license for their one release.” In 술래 SULLAE, the dance proves to be a defiant presence. The women flash on screen as an unbreakable chain reinscribing a gendered history with new sounds and images that gesture to emancipatory possibilities.
술래 SULLAE combines archival clips of ganggang sullae, index pages from intonation books, images of Hangul and English consonants and audio splices from YouTube tutorials on how to pronounce English correctly. In the video, language becomes unhinged from expectation but at the same time, given form through history. The sound of the English language is disembodied and spliced into phonemic pulses. In 술래 SULLAE, Chun has created an encounter with the grammars of polyphony; a simultaneity of sounds that are both restrained by and resistant to the imposition of English on the Korean diaspora. Through what Chun has described as a form of “unlanguaging” following Rey Chow, her audience is witness to new meanings produced through the abstraction, manipulation, and redaction of sounds and symbols from the English language.
Chun’s editing and manipulation of English sounds is intentional. In an interview with Art Forum, Chun shares: “Taking the sound apart but still keeping it within the conceptual framework of English made me think about what else is embedded in making a language. English is tied up with legacies of imperialism; there’s so much unseen violence that is part of how this language is institutionalized.” What remains after the edits is an inventory of sounds that disrupts the primacy of the vowel as central to English word construction and thus, central to colonial imagination.
Like Chun, I realize that my conceptualizing of language is within an English framework, but my hope is that when we turn to the affective and when we begin to pull language a part, something different, something resistant, is produced. I am neither an expert in English nor Korean linguistics, it was the sounds in this work that pulled me into it. In thinking with 술래 SULLAE, I’m interested in what becomes possible in the absence of the vowel. I turn to the interruptive potential of consonant sounds to affect and incite methods of communication outside of those steeped in colonial dominance. What does it mean to de-emphasize the function of vowel sounds in language and reorient our listening to the consonant? What do consonant sounds teach us about the sonics of race that underwrite hierarchies of language? What methods of communication become possible when we do away with words and are left with only their sonic substance?
Through her assemblage of consonant sounds in 술래 SULLAE, Chun is making a deliberate choice to describe and animate a politics of language through refusing its colonial enclosures and turning to the aesthetic in order hold the excesses of description. She refuses the vowel in this piece, not by denying its presence, but instead relegating it to the soundless and the unfamiliar, a space of, in her words, “untranslatability.” In this undoing, consonants become the emotive force where new meanings and orientations to the sounds that mark our words are forged.
술래 SULLAE opens with the sound ssshhh; a pairing of consonant sounds that is often associated with insisting on silence, a sound meant to reprimand. Chun extracts and emplaces this sound in a new aesthetic landscape that is independent and unregulated by colonial schemas of enunciation and translation. The prominent soundscapes of the video are consonant sounds and when removed from their phonetic relations to vowels these sounds undo the presumptive structuring or potential reprimand of English. In 술래 SULLAE, we are meant to experience the fullness of the consonants’ timbre…ssshhh, ppp, ddd, tttt, kkkk…these edited clips of sound originally meant to instruct and assimilate speech into English pronunciations now serve a different function. For me, they secure Chun’s political orientation: one that is about the crafting of a world that involves the careful consideration of the logistics, function, and embedded emotions of the sounds that inhabit it.
All languages contain their own unique set of vowels and consonants, but, Anne Carson reminds us that: “The importance of vowels to human speech has remained. There are words in English without consonants, but so central are vowels to word construction that there isn’t a word in English that doesn’t include a vowel.” In speech, consonants sounds are meant to break up the intended agenda of vowels. The ssshhh, ppp, ddd, tttt, kkkk, are antithetical to the circle or the rounded mouth needed to voice a vowel sound. Unlike the openness of a vowel, producing consonant sounds involves a narrowing of the vocal tract. This narrowing is referred to as constriction or the obstruction of breath whereby sound is produced by a form of corporeal tension. Consonant sounds also demand all the mechanics of the mouth: the lips, the teeth, the tongue, and the palette. Shhhh, requires the corners of the lips to lower and rather than rounding, the lips become pursed, and teeth become exposed. Parts of the mouth are drawn in. The soft palate is raised, and the tongue reaches upwards towards the roof of the mouth without touching it and then the tip of the tongue lowers behind the teeth.
Consonants emerge out of collectivity. Where a vowel is sounded without vocal constraint, consonants require more effort. Their sounds are produced through intricate bodily choreographies in the mouth that involve both constriction and collaboration. Ganggangsullae likewise relies on effort and interdependence. Participants collectively determine the speed and/or shape of their dance. They may even become serpentine or separate into smaller circles depending on what the group decides. The dance also provides an aesthetic space for its participants to voice frustration, anger, and tension through song with the hopes of producing reprieve from gendered hardships. Chun has decided to withhold these songs from her audience; we never hear the women singing. Through this erasure, Chun embeds the consonant sound with affective force whereby a politics of language and gendered presence is enunciated through and beyond a form of silencing. The dance redirects trajectories of dominance whereby the shushing takes on a new voice imbued with agency and hope. Because of how Chun isolates and amplifies its sound, ssshhhh is free to take on different meanings and associations. For me, I was reminded of rushing water or gusts of wind, or the sound used to lull my child to sleep. I was brought into another index of knowing and relating.
The sounds of language hold erasures and layered histories often obfuscated by our mundane encounters with them. Largely understood as the most sonorous part of the syllable, vowels produce the loudest speech sounds and their capacity for holding larger amplitudes or louder volumes have been linked to the sonic expression of emotion. Consonant sounds are more pragmatic than vowels. They are known for their functionality, for the ways in which they assemble the semantic structure of words and for their capacities to hold vowels in place or as Anne Carson describes as “delineating meaning amid the flow of open vowel sounds.” Consonant and vowel sounds map out different functional trajectories by virtue of the shape of mouth and orientation of breath that these sounds demand. Like Chun, I’m interested in what political orientations become possible when we source emotion elsewhere, beyond the confines of spoken words imposed upon us.
The word consonant is a noun, a word used to identify or classify, a semantic enclosure that establishes a subject or object. But unlike the word vowel, consonant is also an adjective. A consonant possesses the capacity to describe, to name, to tell us more. Adjectives parcel out description on states of being, in this way, they are inherently phenomenological. In 술래 SULLAE, Chun empties vowels of their sonic substance leaving behind traces of fragmented characters and differently shaped circles in their wake. They are stripped of breath and their symbolic value forming a new method of communication that reroutes expectations of what language, as we know it, can do and sound like. Like, ganggang sullae, the vowel is premised on the shape of a circle, but in 술래 SULLAE, Chun provokes us to think about what becomes possible beyond the circular structuring device, what becomes possible beyond the purview of the violent embeddedness of English and its colonial exigencies.
Chun has noted that the moon that hovers above the ganggangsullae is yet another site of imperial conquest. In Art Forum, Chun states: “when I look up at it to feel comforted or to find solace, I’m reminded of colonial violence and an agenda that’s projected onto it. In that way, the moon also reflected how I see language.” Chun’s turn to consonants signals a reshaping of the colonial frame that does not disavow or idealize the legacies of imperialism on systems of communication, but instead highlights the tensions and obstructions produced in its shadows.
Featured image: 술래 SULLAE, 2020, single-channel version, courtesy of artist
Casey Mecija is an Assistant Professor in the Department of Communication & Media Studies at York University. Her current research examines sound as a mode of affective, psychic and social representation, specifically in relation to diasporic experience. Drawing on sound studies, queer diaspora studies and Filipinx Studies, her research considers how sensorial encounters are enmeshed and disciplined by social and psychic conditions. She is also a musician and filmmaker, whose work has received a number of accolades and has been presented internationally.
REWIND! . . .If you liked this post, you may also dig:
Blank Space and “Asymmetries of Childhood Innocence” –Casey Mecija