Beyond the Every Day: Vocal Potential in AI Mediated Communication


In summer 2021, sound artist, engineer, musician, and educator Johann Diedrick convened a panel at the intersection of racial bias, listening, and AI technology at Pioneerworks in Brooklyn, NY. Diedrick, 2021 Mozilla Creative Media award recipient and creator of such works as Dark Matters, is currently working on identifying the origins of racial bias in voice interface systems. Dark Matters, according to Squeaky Wheel, “exposes the absence of Black speech in the datasets used to train voice interface systems in consumer artificial intelligence products such as Alexa and Siri. Utilizing 3D modeling, sound, and storytelling, the project challenges our communities to grapple with racism and inequity through speech and the spoken word, and how AI systems underserve Black communities.” And now, he’s working with SO! as guest editor for this series for Sounding Out! (along with ed-in-chief JS!). It starts today, with Amina Abbas-Nazari, helping us to understand how Speech AI systems operate from a very limiting set of assumptions about the human voice– are we training it, or is it actually training us?
—
Hi, good morning. I’m calling in from Bangalore, India.” I’m talking on speakerphone to a man with an obvious Indian accent. He pauses. “Now I have enabled the accent translation,” he says. It’s the same person, but he sounds completely different: loud and slightly nasal, impossible to distinguish from the accents of my friends in Brooklyn.
The AI startup erasing call center worker accents: is it fighting bias – or perpetuating it? (Wilfred Chan, 24 August 2022)
This telephone interaction was recounted in The Guardian reporting on a Silicon Valley tech start-up called Sanas. The company provides AI enabled technology for real-time voice modification for call centre workers voices to sound more “Western”. The company describes this venture as a solution to improve communication between typically American callers and call centre workers, who might be based in countries such as Philippines and India. Meanwhile, research has found that major companies’ AI interactive speech systems exhibit considerable racial imbalance when trying to recognise Black voices compared to white speakers. As a result, in the hopes of being better heard and understood, Google smart speaker users with regional or ethnic American accents relay that they find themselves contorting their mouths to imitate Midwestern American accents.
These instances describe racial biases present in voice interactions with AI enabled and mediated communication systems, whereby sounding ‘Western’ entitles one to more efficient communication, better usability, or increased access to services. This is not a problem specific to AI though. Linguistics researcher John Baugh, writing in 2002, describes how linguistic profiling is known to have resulted in housing being denied to people of colour in the US via telephone interactions. Jennifer Stoever‘s The Sonic Color Line (2016) presents a cultural and political history of the racialized body and how it both informed and was informed by emergent sound technologies. AI mediated communication repeats and reinforces biases that pre-exist the technology itself, but also helping it become even more widely pervasive.

Mozilla’s commendable Common Voice project aims to ‘teach machines how real people speak’ by building an open source, multi-language dataset of voices to improve usability for non-Western speaking or sounding voices. But singer and musicologist, Nina Sun Eidsheim describes how ’a specific voice’s sonic potentiality [in] its execution can exceed imagination’ (7), and voices as having ‘an infinity of unrealised manifestations’ (8) in The Race of Sound (2019). Eidsheim’s sentiments describe a vocal potential, through musicality, that exists beyond ideas of accents and dialects, and vocal markers of categorised identity. As a practicing vocal performer, I recognise and resonate with Eidsheim’s ideas I have a particular interest in extended and experimental vocality, especially gained through my time singing with Musarc Choir and working with artist Fani Parali. In these instances, I have experienced the pleasurable challenge of being asked to vocalise the mythical, animal, imagined, alien and otherworldly edges of the sonic sphere, to explore complex relations between bodies, ecologies, space and time, illuminated through vocal expression.

Following from Eidsheim, and through my own vocal practice, I believe AI’s prerequisite of voices as “fixed, extractable, and measurable ‘sound object[s]’ located within the body” is over-simplistic and reductive. Voices, within systems of AI, are made to seem only as computable delineations of person, personality and identity, constrained to standardised stereotypes. By highlighting vocal potential, I offer a unique critique of the way voices are currently comprehended in AI recognition systems. When we appreciate the voice beyond the homogenous, we give it authority and autonomy, ultimately leading to a fuller understanding of the voice and its sounding capabilities.
My current PhD research, Speculative Voicing, applies thinking about the voice from a musical perspective to the sound and sounding of voices in artificially intelligent conversational systems. Herby the voice becomes an instrument of the body to explore its sonic materiality, vocal potential and extremities of expression, rather than being comprehended in conjunction to vocal markers of identity aligning to categories of race, gender, age, etc. In turn, this opens space for the voice to be understood as a shapeshifting, morphing and malleable entity, with immense sounding potential beyond what might be considered ordinary or everyday speech. Over the long term this provides discussion of how experimenting with vocal potential may illuminate more diverse perspectives about our sense of self and being in relation to vocal sounding.
Vocal and movement artist Elaine Mitchener exhibits the disillusion of the voice as ‘fixed’ perfectly in her performance of Christian Marclay’s No!, which I attended one hot summer’s evening at the London Contemporary Music Festival in 2022. Marclay’s graphic score uses cut outs from comic book strips to direct the performer to vocalise a myriad of ‘No”s.

Mitchener’s rendering of the piece involved the cooperation and coordination of her entire body, carefully crafting lips, teeth, tongue, muscles and ligaments to construct each iteration of ‘No.’ Each transmutation of Mitchener’s ‘No’s’ came with a distinct meaning, context, and significance, contained within the vocalisation of this one simple syllable. Every utterance explored a new vocal potential, enabled by her body alone. In the context of AI mediated communication, we can see this way of working with the voice renders the idea of the voice as ‘fixed’ as redundant. Mitchener’s vocal potential demonstrates that voices can and do exist beyond AI’s prescribed comprehension of vocal sounding.
In order to further understand how AI transcribes understandings of voice onto notions of identity, and vocal potential, I produced the practice project Polyphonic Embodiment(s) as part of my PhD research, in collaboration with Nestor Pestana, with AI development by Sitraka Rakotoniaina. The AI we created for this project is based upon a speech-to-face recognition AI that aims to be able to tell what your face looks like from the sound of your voice. The prospective impact of this AI is deeply unsettling, as its intended applications are wide-ranging – from entertainment to security, and as previously described AI recognition systems are inherently biased.

This multi-modal form of comprehending voice is also a hot topic of research being conducted by major research institutions including Oxford University and Massachusetts Institute of Technology. We wanted to explore this AI recognition programme in conjunction with an understanding of vocal potential and the voice as a sonic material shaped by the body. As the project title suggests, the work invites people to consider the multi-dimensional nature of voice and vocal identity from an embodied standpoint. Additionally, it calls for contemplation of the relationships between voice and identity, and individuals having multiple or evolving versions of identity. The collaboration with the custom-made AI software creates a feedback loop to reflect on how peoples’ vocal sounding is “seen” by AI, to contest the way voices are currently heard, comprehended and utilised by AI, and indeed the AI industry.
The video documentation for this project shows ‘facial’ images produced by the voice-to-face recognition AI, when activated by my voice, modified with simple DIY voice devices. Each new voice variation, created by each device, produces a different outputted face image. Some images perhaps resemble my face? (e.g. Device #8) some might be considered more masculine? (e.g. Device #10) and some are just disconcerting (e.g. Device #4). The speculative nature of Polyphonic Embodiment(s) is not to suggest that people should modify their voices in interaction with AI communication systems. Rather the simple devices work with bodily architecture and exaggerate its materiality, considering it as a flexible instrument to explore vocal potential. In turn this sheds light on the normative assumptions contained within AI’s readings of voice and its relationships to facial image and identity construction.
Through this artistic, practice-led research I hope to evolve and augment discussion around how the sounding of voices is comprehended by different disciplines of research. Taking a standpoint from music and design practice, I believe this can contest ways of working in the realms of AI mediated communication and shape the ways we understand notions of (vocal) identity: as complex, fluid, malleable, and ultimately not reducible to Western logics of sounding.
—
Featured Image: Still image from Polyphonic Embodiments, courtesy of author.
—
Amina Abbas-Nazari is a practicing speculative designer, researcher, and vocal performer. Amina has researched the voice in conjunction with emerging technology, through practice, since 2008 and is now completing a PhD in the School of Communication at the Royal College of Art, focusing on the sound and sounding of voices in artificially intelligent conversational systems. She has presented her work at the London Design Festival, Design Museum, Barbican Centre, V&A, Milan Furniture Fair, Venice Architecture Biennial, Critical Media Lab, Switzerland, Litost Gallery, Prague and Harvard University, America. She has performed internationally with choirs and regularly collaborates with artists as an experimental vocalist
—

REWIND! . . .If you liked this post, you may also dig:
What is a Voice?–Alexis Deighton MacIntyre
Voice as Ecology: Voice Donation, Materiality, Identity–-Steph Ceraso
Mr. and Mrs. Talking Machine: The Euphonia, the Phonograph, and the Gendering of Nineteenth Century Mechanical Speech – J. Martin Vest
One Scream is All it Takes: Voice Activated Personal Safety, Audio Surveillance, and Gender Violence—María Edurne Zuazu
Echo and the Chorus of Female Machines—AO Roberts
On Sound and Pleasure: Meditations on the Human Voice– Yvon Bonefant
Tuning In to the Desi Valley: Getting to Know a Community via Radio

Sound has a peculiar relationship to mindfulness; zoning in and out, active and passive forms of listening while we situate our listening practices alongside other daily activities. Especially when it comes to driving, listening to something or someone or just singing aloud by myself, I have realized, helps me drown out other noises of alertness. Over the years I have come to value background music or chatter and especially radio programming that takes the burden of curation and scheduling off my back, in all sorts of tasks that require deep concentration. Enough and more has been said about the visual-bias in various forms of ethnographic inquiry (see Andrew C. Sparkes’s “Ethnography and the senses” for a good example). Without belaboring these arguments, I also find that knowing through listening and listening as a mode of non-haptic yet immersive and intimate engagement can also prove to be a fruitful method of inquiry, especially in our post-pandemic worlds, where it feels a lot harder to establish intimacy. The United Nations noted that radio, in particular, “provided solace” during that period of physical distancing and social isolation.
For me, radio sparked my accidental realization and foregrounding of sonic methods as an itinerant means of getting to know new things, people and surroundings in life and research when I moved from New York to the San Francisco Bay Area in mid-2022 to start a new position as a postdoctoral researcher. Knowing that I would continue living in California for the near future, after eight long years of having deferred driving in America, I decided to learn driving and buy a car. I was also especially excited to be moving to Sunnyvale, a city in the Southern Peninsula, located between more known places like Palo Alto and San Jose. Sunnyvale is often jokingly called the desi capital, perhaps the most Indian of any ‘Little India’ you could find in America. As shorthand, desi, a Hindi word,refers to anyone and everything with ties to the South Asian subcontinent. In more recent years, the term has gained currency especially among South Asian diasporic communities to self-refer to culture, music, food, often to signify the presence and strength of transnational ties (between India and their countries of settlement).
What I hadn’t anticipated was that driving brought about a new connection to the medium of radio, as I started tuning in to take my mind off the jitters that come with the new sounds of an automobile-dependent region: fast moving rubber hitting the freeway tarmac and the lane changes and the scalar adjustments that demand driving tens of miles to ensure you still have a social life stitched together across the vastness of California. I began to find the act of tuning in and out of stations fascinating, especially how the radio as a device holds the parallel realities of so many people with different interests, languages and politics together but separate. Scholarship and even casual listening shows that non-English radio stations catering to various immigrant and other communities have existed for a long time in the US and elsewhere (I wish pondering the sonic geographies of Radio Garden, a web-based map interface that allows listeners to access any free-to-air radio stations across the world, wasn’t beyond the scope of this post!). Both as an ethnographer and as an insider-outsider within the larger Indian immigrant and diaspora community (but a newcomer to the Bay Area), tuning into the local desi radio station while driving offered me a good way to enter the desi community of the Bay Area, to know what it means to be desi and perform Indianness in 2022, in a place where I regularly see so many Indians and South Asians every single day.
As a friend who recently visited me in Sunnyvale remarked as we waited for our table at the always-busy Madras Cafe, Sunnyvale feels so much like India! And I felt it too; what she was referring to was not only the very visible presence of Indian people all around town or the abundance of restaurants catering to various sub-regional cuisines from India. What felt different to me here was how relaxed everyone looked in grocery stores, restaurants and elsewhere, how utterly remade Sunnyvale is as a pan-Asian but mostly Indian space that even the smallest performances of fitting in feel unnecessary. In fact, shopping her reminded me of a point Purnima Mankekar’s Indian narrators made in her iconic essay on Indian grocery shopping in the San Francisco Bay Area (2010), that Indian stores in Sunnyvale and Milpitas as places where “white people look out of place” as compared to the ones in Berkeley.
The juxtaposition of the non-performances and the weight of being comfortably in place in a community like Sunnyvale only settles on the mind and body very slowly, like a faint but familiar smell from home. Words and demands often stumble out of my mouth at the grocery store, as if I am allowed and I will be completely understood. Immigrant life in America is steeped in language acrobatics, balancing being understood with becoming deliberately opaquely incomprehensible, using one accent for the ones from home, one for those who make you feel at home, and then the American accent for the Americans outside. The contrast of places like Sunnyvale, Fremont, Milpitas and other similar Northern California cities that have been transformed by immigrant presence is not just one of tangible and observable things, but sonic markers too, like Bolly 92.3FM.

Bolly 92.3FM is the default desiradio station that services the entire San Francisco Bay Area. As the name suggests, for the most part the station plays popular Bollywood songs from recently released movies and albums, but just as other stations do in India, Bolly92.3FM leverages its listener demographics at different times of the day to also play classics and hits from older Hindi films during late night slots. Interestingly, as the presenters repeat the station name and jingle time and again, they also remind you that you are listening to Bay Area’s Bollywood station owned by the Silicon Valley Asian media network. The name Bolly92.3FM is a play on the broad familiarity with Bollywood in the US even as Indian audiences globally have moved away from the older connotations of Bollywood as North Indian cinema with song, dance and people dressed in flashy clothes. Much like the hyper-authentic Indian restaurants that serve regional cuisines such as Andhra, Tamil, Gujarati, Rajasthani and Marathi food to their loyal and affording immigrant patrons, Bolly92.3FM has also configured its programming to cater to different regional and linguistic communities from India in the Bay Area. For instance, Saturday morning and afternoon slots are dedicated to Telugu programs—everything is in Telugu (language) from the hosts discussions to the songs being played as well as the actual topics being discussed—as if the station turns into a different station with the implicit acknowledgement of the substantial cultural presence and possibly the sonic and financial power of Telugu listeners among the wider Indian community here. There are similar slots dedicated to Gujarati language programming.
In addition to language and topical interests, listening to Bolly92.3FM has been instructive in getting a feel for communal desires, aspirations and anxieties through its advertisement. There are fellow desi real estate agents, tax planners, dentists, travel agencies and coaching centers, each signaling to why they are trustworthy. Some remind their audiences of their shared cultural background as key to them being able to understand their customers’ needs, others also indicate their familiarity with America: one realtor is the only Indian-origin person to feature in a top realtor list, another tax planner’s family has been in the US for three generations and thus he is well aware of the nitty-gritties of transnational estate planning. A known Indian-origin realtor in the area even sponsors his own radio show on the weekends where he takes questions from prospective home buyers and sellers. Before and beyond giving them financial advice, he often explains how fellow Indian immigrants think about financial opportunities, investments, how they might seek social validation from fellow Indians who might see their home and so on. He weaves in such exposition before talking dry financial facts about mortgages. The same tax planner and his co-host occasionally offer historical accounts of changing real estate trends, how certain places used to be affordable for Indian immigrant buyers and how new places are becoming of interest as vacation homes for more affluent Indian immigrants.

Hearing the tax planners and real estate agents plot these dynamic and speculative maps of South Asian financial, cultural and political futures week-after-week felt like witnessing what many historical texts on migration within the Bay Area have described as waves in the past. As I mentioned earlier, Sunnyvale is not the only ‘Little India’ in the Bay Area, let alone in California, but rather, is a more recent iconic place in the Indian and South Asian diaspora map where the younger and newer immigrants are finding homes. Fremont, Milpitas, and Hayward in the East Bay closer to Oakland, and San Jose in the South Bay, saw similar waves of Indian immigrant settlements in the past, many of whom now far more affluent than their younger counterparts. In my short time since moving here, I learned both from conversations with friends who grew up in the area as well as from communication scholar Anne Marie Todd’s work on the past and present of Santa Clara Valley, this region has not only seen waves of migration from settlers across the world but with each incoming wave and turn in occupational trends from farming to railroads to IT work. Multiple communities have remade cities in the Bay Area over the decades.
I also find advertisements as well as talk shows interesting because they offer a more proximate and concentrated triangulation of otherwise scattered, overheard communal talk – I’ve heard things about Indian immigrants and real estate, their aspirations for their children to get into Ivy league colleges. I have also been asked at the grocery store if I know people looking to get married; I’ve overheard aunties at the grocery store consulting each other about dealing with death, financial loss, planetary alignments and more. But it is through the radio that these private interests, anxieties and futures take more collective and articulate forms.
To speak, to be heard, and to be understood as intended are as important as visual representation to allow for feeling in place in the world and for feeling at home in the United States. Very simply put, in the American context, while Black and Brown vocal expression and volume or the stereotype of loudness have been historically stigmatized as a part of the larger racist depiction of ‘unruly’ bodies, various forms of Asian speech, languages and expression including loudness but also silence and the absence of vocality, have also been racialized against the backdrop of white socio-linguistic normativity. Specifically, Indian and South Asian immigrants have been repeatedly represented in popular American culture as muted characters whose interiority is either irrelevant to the plot or cannot be accessed. Think of Raj Koothrappali from the show The Big Bang Theory, an accomplished scientist at the prestigious Caltech university who loses his voice around women. In accounts of Indian tech workers in the US from the Y2K era, one finds multiple articles casting them as the back-office worker – good at laborious and boring work but not presentation material. Some of these stereotypes have changed and splintered as more South Asians technologists have gone on to become successful executives and leaders in US companies, but vocality as a form of publicity can be crucial to sonic forms of belonging.

When I arrived in the Bay Area, I was still trying to figure out how to form a community and how to immerse myself in the desi community here, somewhat selfishly to get a glimpse of how Indian presence is remaking the Valley, not only through IT work but also through cultural and political performances. But I also wanted to get to know people, be known by people around. Olakhita means “people known to us” in Gujarati, my mother tongue; in Hindi jaan-pehchaan means to be in each other’s knowing that gives you some claim and affiliation over others without deep intimacy, not quite acquaintances or neighbors like in the American context. Apart from the numerous Whatsapp and Facebook groups aimed at desifolks living in the same neighborhood or city, Bolly92.3FM acts as a beacon for Indians and South Asians spread across San Francisco, the East Bay and the South Bay areas, offering a medium for the various Indian associations and event organizers to reach out to thousands and invite them to Diwali, Dussehra and other festival celebrations as well as any major concerts by visiting Indian artists in the area.

This is the far from the first or only time that a radio station has facilitated the forging of affective ties and social and material connections in diaspora. Rather this post recounts how active and passive listening to and through the station revealed over time how much South Asian presence has transformed the Bay Area. I attended the massive Diwali and Dussehra events advertised on 92.3, and once there, I could recognize so many of the organizers and sponsors’ names – some of them were the same tax planners and realtors that also run ads and sponsored segments on the station. One week in late September, an ad played on the radio, announcing that a famous hotel in downtown San Jose was now able to accommodate more than a thousand guests and offer a special entryway for the baraat procession (when the groom’s party of a few hundred comes dancing up to the wedding venue). The ad ended with the contact details of a South Asian representative of the hotel who could handle all queries related to Indian wedding arrangements! Bolly92.3FM mediates and shapes the collective desi identity through its programming and advertising, in turn also stitching, materializing and rendering visible a map of the Indian community spread across dozens of non-contiguous cities and neighborhoods in the Bay Area.

It bears noting, however, that the idea of Indianness or desi-ness (an imagined brotherhood among all the expats here) advanced through the radio programs, advertisements as well as the cultural celebrations, is very much a nationalistic and Hindu-dominant one. Although India is an ethnically, linguistically and culturally diverse country and not all people of Indian origin in the Bay Area are Hindu, the radio announcers never really celebrate or discuss Eid, Christmas, or Thanksgiving as events of possible interest. Just based on what is said and what is left unsaid, the dominant self-imagination of the Bay Area desi community as advanced through the radio station feels like it quietly aligns with the dominant religious and political imagination of India as Hindu, middle-class, post-religious and post-caste.
In the process of seeing this map render in my own imagination through regular listening, I also realized how this form of distant listening replicated the mode of jaan-pehchan (getting to know) for the itinerant immigrant-ethnographer. There is always the pre-fieldwork moment, the slightly promiscuous exploratory moment of getting to know and immersing oneself before one can articulate stakes and research questions. It is also often a period of deep uncertainty and ambivalence, since, as ethnographers, responsibility for and towards our field, communities, and interlocutors is always at the center of every project. We are always reminded not to be extractive and to think deeply about power relations even as we attempt to forge meaningful ties with those whom we want to observe and learn from and write about. Much like other visa-workers and international scholars in US academia, being a non-citizen ethnographer engenders multiple kinds of precarities—there is no straightforward or replicable guidebook on how to establish rapport, gain access, build trust and more with a community. More importantly, the itinerant-immigrant ethnographer’s relationships are also always interrupted, prone to arbitrary border restrictions and chronic deracination.
In the early days of digital ethnography, Kate Crawford powerfully argued to reframe acts of lurking (silently and passively hanging out in online communities) as forms of listening and by extension, listening as a concomitant and constitutive practice when we consider participation as speaking or having a voice. In the case of diasporic radio, as I realized, not only is the act of listening quite literal but it also affords and reinforces the vitality of different modes of agentic power and participation, those marked by ambivalence, yet-to-be gained legitimacy; forms of minor participation if you will. Via Crawford, listening and/as lurking also emerges as specifically racially inflected modes of agentic participation against the backdrop of media policy and the emphasis on free expression and speech as the ultimate realization of democratic power.
Among diasporic communities and further among itinerant immigrants within those communities, listening, overhearing and eavesdropping become the de facto modes of democratic and communitarian participation. Listening to the radio as a way of immersion is not a solution to these enduring dilemmas of ethical ethnography but to borrow from the analogy of Californian driving, listening to the radio, just like other forms of digital and analog lurking, allowed me an ‘on-ramp’ to gradually merge and embed myself in the larger South Asian diasporic community.
—
Featured Image: “Driving” by Flickr User AnnaNakami (CC BY-NC 2.0)
—
Noopur Raval is an interdisciplinary researcher interested in understanding global futures of work, the life and work decisions made by immigrant workers in tech companies, changing values and moral norms and projects of personhood especially in postcolonial settings. Noopur received a PhD in Informatics from University of California Irvine (UCI) in September 2020 and, through July 2022 was a postdoc researcher at the AI Now Institute at New York University. Noopur is currently a postdoctoral researcher at UC Santa Cruz – Silicon Valley Extension in the Computational Media department, working with Dr Norman Su. In Fall 2023, Noopur will join the Department of Information Studies at UCLA as an assistant professor.
—

REWIND! . . .If you liked this post, you may also dig:
“Gendered Soundscapes of India, an Introduction“–Monika Mehta and Praseeda Gopinath
The Queer Sound of the Dandiya Queen, Falguni Pathak–-Pavitra Sundar
“Out of Sync: Gendered Location Sound Work in Bollywood“—Priya Jaikumar
SO! LA: Sounding the California Story–Bridget Hoida
“Vous Ecoutez La Voix du Peuple”: The Kreyol Language Pirate Radio Stations of Flatbush, Brooklyn–David Goren
Listening (Loudly) to Spanish-language Radio–Dolores Inés Casillas
Recent Comments