Archive | Digital Media RSS for this section

Beyond the Every Day: Vocal Potential in AI Mediated Communication 

In summer 2021, sound artist, engineer, musician, and educator Johann Diedrick convened a panel at the intersection of racial bias, listening, and AI technology at Pioneerworks in Brooklyn, NY. Diedrick, 2021 Mozilla Creative Media award recipient and creator of such works as Dark Matters, is currently working on identifying the origins of racial bias in voice interface systems. Dark Matters, according to Squeaky Wheel, “exposes the absence of Black speech in the datasets used to train voice interface systems in consumer artificial intelligence products such as Alexa and Siri. Utilizing 3D modeling, sound, and storytelling, the project challenges our communities to grapple with racism and inequity through speech and the spoken word, and how AI systems underserve Black communities.” And now, he’s working with SO! as guest editor for this series for Sounding Out! (along with ed-in-chief JS!). It starts today, with Amina Abbas-Nazari, helping us to understand how Speech AI systems operate from a very limiting set of assumptions about the human voice– are we training it, or is it actually training us?


Hi, good morning. I’m calling in from Bangalore, India.” I’m talking on speakerphone to a man with an obvious Indian accent. He pauses. “Now I have enabled the accent translation,” he says. It’s the same person, but he sounds completely different: loud and slightly nasal, impossible to distinguish from the accents of my friends in Brooklyn.

The AI startup erasing call center worker accents: is it fighting bias – or perpetuating it? (Wilfred Chan, 24 August 2022)

This telephone interaction was recounted in The Guardian reporting on a Silicon Valley tech start-up called Sanas. The company provides AI enabled technology for real-time voice modification for call centre workers voices to sound more “Western”. The company describes this venture as a solution to improve communication between typically American callers and call centre workers, who might be based in countries such as Philippines and India. Meanwhile, research has found that major companies’ AI interactive speech systems exhibit considerable racial imbalance when trying to recognise Black voices compared to white speakers. As a result, in the hopes of being better heard and understood, Google smart speaker users with regional or ethnic American accents relay that they find themselves contorting their mouths to imitate Midwestern American accents.

These instances describe racial biases present in voice interactions with AI enabled and mediated communication systems, whereby sounding ‘Western’ entitles one to more efficient communication, better usability, or increased access to services. This is not a problem specific to AI though. Linguistics researcher John Baugh, writing in 2002, describes how  linguistic profiling is known to have resulted in housing being denied to people of colour in the US via telephone interactions. Jennifer Stoever‘s The Sonic Color Line (2016) presents a cultural and political history of the racialized body and how it both informed and was informed by emergent sound technologies. AI mediated communication repeats and reinforces biases that pre-exist the technology itself, but also helping it become even more widely pervasive.

“pain” by Flickr user Pol Neiman (CC BY-NC-ND 2.0)

Mozilla’s commendable Common Voice project aims to ‘teach machines how real people speak’ by building an open source, multi-language dataset of voices to improve usability for non-Western speaking or sounding voices. But singer and musicologist, Nina Sun Eidsheim describes how ’a specific voice’s sonic potentiality [in] its execution can exceed imagination’ (7), and voices as having ‘an infinity of unrealised manifestations’ (8) in The Race of Sound (2019). Eidsheim’s sentiments describe a vocal potential, through musicality, that exists beyond ideas of accents and dialects, and vocal markers of categorised identity. As a practicing vocal performer, I recognise and resonate with Eidsheim’s ideas I have a particular interest in extended and experimental vocality, especially gained through my time singing with Musarc Choir and working with artist Fani Parali. In these instances, I have experienced the pleasurable challenge of being asked to vocalise the mythical, animal, imagined, alien and otherworldly edges of the sonic sphere, to explore complex relations between bodies, ecologies, space and time, illuminated through vocal expression.

Joy by Flickr user François Karm, cropped by SO! (CC BY-NC 2.0)

Following from Eidsheim, and through my own vocal practice, I believe AI’s prerequisite of voices as “fixed, extractable, and measurable ‘sound object[s]’ located within the body” is over-simplistic and reductive. Voices, within systems of AI, are made to seem only as computable delineations of person, personality and identity, constrained to standardised stereotypes. By highlighting vocal potential, I offer a unique critique of the way voices are currently comprehended in AI recognition systems. When we appreciate the voice beyond the homogenous, we give it authority and autonomy, ultimately leading to a fuller understanding of the voice and its sounding capabilities.

My current PhD research, Speculative Voicing, applies thinking about the voice from a musical perspective to the sound and sounding of voices in artificially intelligent conversational systems. Herby the voice becomes an instrument of the body to explore its sonic materiality, vocal potential and extremities of expression, rather than being comprehended in conjunction to vocal markers of identity aligning to categories of race, gender, age, etc. In turn, this opens space for the voice to be understood as a shapeshifting, morphing and malleable entity, with immense sounding potential beyond what might be considered ordinary or everyday speech. Over the long term this provides discussion of how experimenting with vocal potential may illuminate more diverse perspectives about our sense of self and being in relation to vocal sounding.

Vocal and movement artist Elaine Mitchener exhibits the disillusion of the voice as ‘fixed’ perfectly in her performance of Christian Marclay’s No!, which I attended one hot summer’s evening at the London Contemporary Music Festival in 2022. Marclay’s graphic score uses cut outs from comic book strips to direct the performer to vocalise a myriad of ‘No”s.

In connection with Fraenkel Gallery’s 2021 exhibition, experimental vocalist Elaine Mitchener performs Christian Marclay’s graphic score, “No!” Image by author.

Mitchener’s rendering of the piece involved the cooperation and coordination of her entire body, carefully crafting lips, teeth, tongue, muscles and ligaments to construct each iteration of ‘No.’ Each transmutation of Mitchener’s ‘No’s’ came with a distinct meaning, context, and significance, contained within the vocalisation of this one simple syllable. Every utterance explored a new vocal potential, enabled by her body alone. In the context of AI mediated communication, we can see this way of working with the voice renders the idea of the voice as ‘fixed’ as redundant. Mitchener’s vocal potential demonstrates that voices can and do exist beyond AI’s prescribed comprehension of vocal sounding.

In order to further understand how AI transcribes understandings of voice onto notions of identity, and vocal potential, I produced the practice project Polyphonic Embodiment(s) as part of my PhD research, in collaboration with Nestor Pestana, with AI development by Sitraka Rakotoniaina. The AI we created for this project is based upon a speech-to-face recognition AI that aims to be able to tell what your face looks like from the sound of your voice. The prospective impact of this AI is deeply unsettling, as  its intended applications are wide-ranging – from entertainment to security, and as previously described AI recognition systems are inherently biased.

Still from project video for Polyphonic Embodiment(s). Image by author.

This multi-modal form of comprehending voice is also a hot topic of research being conducted by major research institutions including Oxford University and Massachusetts Institute of Technology. We wanted to explore this AI recognition programme in conjunction with an understanding of vocal potential and the voice as a sonic material shaped by the body. As the project title suggests, the work invites people to consider the multi-dimensional nature of voice and vocal identity from an embodied standpoint. Additionally, it calls for contemplation of the relationships between voice and identity, and individuals having multiple or evolving versions of identity. The collaboration with the custom-made AI software creates a feedback loop to reflect on how peoples’ vocal sounding is “seen” by AI, to contest the way voices are currently heard, comprehended and utilised by AI, and indeed the AI industry.

The video documentation for this project shows ‘facial’ images produced by the voice-to-face recognition AI, when activated by my voice, modified with simple DIY voice devices. Each new voice variation, created by each device, produces a different outputted face image. Some images perhaps resemble my face? (e.g. Device #8) some might be considered more masculine? (e.g. Device #10) and some are just disconcerting (e.g. Device #4). The speculative nature of Polyphonic Embodiment(s) is not to suggest that people should modify their voices in interaction with AI communication systems. Rather the simple devices work with bodily architecture and exaggerate its materiality, considering it as a flexible instrument to explore vocal potential. In turn this sheds light on the normative assumptions contained within AI’s readings of voice and its relationships to facial image and identity construction.

Through this artistic, practice-led research I hope to evolve and augment discussion around how the sounding of voices is comprehended by different disciplines of research. Taking a standpoint from music and design practice, I believe this can contest ways of working in the realms of AI mediated communication and shape the ways we understand notions of (vocal) identity: as complex, fluid, malleable, and ultimately not reducible to Western logics of sounding.

Featured Image: Still image from Polyphonic Embodiments, courtesy of author.

— 

Amina Abbas-Nazari is a practicing speculative designer, researcher, and vocal performer. Amina has researched the voice in conjunction with emerging technology, through practice, since 2008 and is now completing a PhD in the School of Communication at the Royal College of Art, focusing on the sound and sounding of voices in artificially intelligent conversational systems. She has presented her work at the London Design Festival, Design Museum, Barbican Centre, V&A, Milan Furniture Fair, Venice Architecture Biennial, Critical Media Lab, Switzerland, Litost Gallery, Prague and Harvard University, America. She has performed internationally with choirs and regularly collaborates with artists as an experimental vocalist

tape-reel

REWIND! . . .If you liked this post, you may also dig:

What is a Voice?–Alexis Deighton MacIntyre

Mr. and Mrs. Talking Machine: The Euphonia, the Phonograph, and the Gendering of Nineteenth Century Mechanical Speech – J. Martin Vest

One Scream is All it Takes: Voice Activated Personal Safety, Audio Surveillance, and Gender ViolenceMaría Edurne Zuazu

Echo and the Chorus of Female MachinesAO Roberts

On Sound and Pleasure: Meditations on the Human Voice– Yvon Bonefant

“Caught a Vibe”: TikTok and The Sonic Germ of Viral Success

“When I wake up, I can’t even stay up/I slept through the day, fuck/I’m not getting younger,” laments Willow Smith of The Anxiety on “Meet Me at Our Spot,” a track released through MSFTSMusic and Roc Nation in March of 2020. Despite the song’s nature as a “sludgy alternative track with emo undertones that hits at the zeitgeist,” “Meet Me at Our Spot” received very little attention after its initial release and did not chart until the summer of 2021, when it went viral on TikTok as part of a dance trend. The short-form video app which exploded in popularity during the COVID-19 pandemic, catalyzed the track’s latent rise to success where it reached no. 21 on the US Billboard Hot 100, becoming Willow’s highest charting song since her 2010 hit, “Whip My Hair”.

The app currently known as TikTok began as Musical.ly, which was shuttered in 2017 and then rebranded in 2018. By March of 2021, the app boasted one billion worldwide monthly users, indicative of a growth rate of about 180%. This explosion was in many ways catalyzed by successive lockdowns during the first waves of the COVID-19 pandemic. Despite the relaxation and subsequent abandonment of COVID mitigation measures, the app has retained a large volume of its users, remaining one of the highest grossing apps in the iOS environment. TikTok’s viral success (both as noun and adjective) has worked to create a kind of vibe economy in which artists are now subject to producing a particular type of sound in order to be rendered legible to the pop charts.

For anyone who has yet to succumb to the TikTok trap, allow me to offer you a brief summary of how it functions. Upon opening it, you are instantly fed content. Devoid of any obvious internal operating logic, it is the media equivalent of drinking from a fire hose. Immersive and fast-paced, users vertically scroll through videos that take up their entire screen. Within five minutes of swiping, you can–if your algorithm is anything like mine–see: cute pet videos, protests against police brutality, HypeHouse dance trends, thirst traps, contemporary music, therapy tips, attractive men chopping wood, attractive women lifting weights, and anything else you can fathom. Since its shift from Musical.ly, the app has also been a staging ground for popular music hits such as Lil Nas X’s’ “Old Town Road”, Lizzo’s “Good As Hell”, and, recently, Harry Styles’ “As It Was.”

The app, which is the perfect–if chaotic–fusion of both radio and video is enmeshed in a wider media ecosystem where social networking and platform capitalism converge, and as a result, it seems that TikTok is changing the music industry in at least three distinct ways:

First, it affects our music consumption habits. After hearing a snippet of a song used for a TikTok, users are more likely to queue it up on their streaming platform of choice for another, more complete listen. Unlike those platforms, where algorithms work to feed a listener more of what they’ve already heard, TikTok feeds a listener new content. As a result, there’s no definitive likelihood that you’ve previously heard the track being used as a sound. Therefore, TikTok works the way that Spotify used to: as a mechanism for discovery.

Second, TikTok is changing the nature of the single. Rather than relying upon a label as the engine behind a song’s success, TikTok disseminates tracks–or sounds as they’re referred to in the app–widely, determining a song’s success or role as a debut within a series of clicks. Particularly during the pandemic, when musicians were unable to tour, TikTok’s relationship to the industry became even more salient. Artists sought new ways to share and promote their music, taking to TikTok to release singles, livestream concerts, and engage with fans. Moreover, Spotify’s increasingly capacious playlist archive began to boast a variety of tracklists with titles such as, “Best TikTok Songs 2019-2022”, “TikTok Songs You Can’t Get Out Of Your Head”, and “TikTok Songs that Are Actually Good” among others. The creation and maintenance of this feedback loop between TikTok and Spotify demonstrates not only the centrality of social media ecosystems as driving current popular music success, but also the way that these technologies work in harmony to promote, sustain, or suppress interest in a particular tune.

Most notoriously, the bridge of Olivia Rodrigo’s “drivers license”, went viral as a sound on TikTok in January 2021 and subsequently almost broke the internet. Critics have praised this 24-second section as the highlight of the song, underscoring Rodrigo’s pleading soprano vocals layered over moody, syncopated digital drums. Shortly after it was released, the song shattered Spotify’s record for single-day streams for a non-holiday song. New York Times writer Joe Coscarelli notes of Rodrigo’s success, “TikTok videos led to social media posts, which led to streams, which led to news articles , and back around again, generating an unbeatable feedback loop.”

And third, where songwriting was once oriented towards the creation of a narrative, TikTok’s influence has led artists to a songwriting practice that centers on producing a mood. For The New Yorker, Kyle Chayka argues that vibes are “a rebuke to the truism that people want narratives,” suggesting that the era of the vibe indicates a shift in online culture. He argues that what brings people online is the search for “moments of audiovisual eloquence,” not narrative. Thus, on the one hand, media have become more immersive in order to take us out of our daily preoccupations. On the other, media have taken on a distinct shape so that they can be engaged while doing something else. In other words, media have adapted to an environment wherein the dominant mode of consumption is keyed toward distraction via atmosphere.

“Vibes graffiti, Leake Street,” Image by Flickr user Duncan Cumming (CC BY-NC 2.0)

Despite their relatively recent resurgence in contemporary discourse, vibes have a rich conceptual history in the United States. Once a shorthand for “vibration” endemic to West Coast hippie vernacular, “vibes” have now come to mean almost anything. In his work on machine learning and the novel form, Peli Grietzer theorizes the vibe by drawing on musician Ezra Koenig’s early aughts blog, “Internet Vibes.” Koenig writes, “A vibe turns out to be something like “local colour,” with a historical dimension. What gives a vibe “authenticity” is its ability to evoke–using a small number of disparate elements–a certain time, place, and milieu, a certain nexus of historic, geographic, and cultural forces.” In his work for Real Life, software engineer Ludwig Yeetgenstein defines the vibe as “something that’s difficult to pin down precisely in words but that’s evoked by a loose collection of ideas, concepts, and things that can be identified by intuition rather than logic.” Where Mitch Thereiau argues that the vibe might just merely be a vocabulary tick of the present moment, Robin James suggests that vibes are not only here to stay, but have in fact been known by many other names before. Black diasporic cultures, in particular, have long believed sound and its “vibrations had the power to produce new possibilities of social attunement and new modes of living,” as Gayle Wald’s “Soul Vibrations: Black Music and Black Freedom in Sound and Space,” attests (674). We might then consider TikTok a key method of dissemination for a maximalist, digital variant of something like Martin Heidegger’s concept of mood (stimmung), or Karen Tongson’s “remote intimacy.” The vibe is both indeterminate and multiple, a status to be achieved and the mood that produces it; vibes seek to promote and diffuse feelings through time and space.

Much current discourse around vibes insists that they interfere with, or even discourage academic interpretation. While some people are able to experience and identify the vibe—perform a vibe check, if you will—vibes defy traditional forms of academic analysis. As Vanessa Valdés points out, “In a post-Enlightenment world that places emphasis on logic and reason, there exists a demand that everything be explained, be made legible.” That the vibe works with a certain degree of strategic nebulousness might in fact be one of its greatest assets.

“Vibes, Shoreditch” by Flickr User Duncan Cumming, (CC BY-NC 2.0)

Vibes resist tidy classification and can thus be named across a variety of circumstances and conditions. Although we might think of the action of ‘vibing’ as embodied, and the term vibration quite literally refers to the physical properties of sound waves and their travel through various mediums, the vibe through which those actions are produced does not itself have to be material. Sometimes, they name a genre of feeling or energy: cursed vibes or cottagecore vibes. Sometimes, they function as a statement of identification: I vibe with that, or in the case of 2 Chainz’s 2016 hit, “it’s a vibe.” Sometimes, vibes are exchanged: you can give one, you can catch one, you can check one, So, while things like energy and mood—which are often taken as cognates for vibes—work to imagine, name, and evoke emotions, vibes are instead invitations.

Not only do vibes serve as a prompt for an attempt at articulating experience, they are also invitations to co-presently experience what seems inarticulable. By capturing patterns in media and culture in order to produce a coherent image/sound assemblage, the production of a vibe is predicated upon the ability to draw upon large swathes of visual, aural, and environmental data. Take for example, the story of Nathan Apodaca, known by his TikTok handle as: 420doggface208. After posting a video of himself listening to Fleetwood Mac’s “Dreams” while drinking cranberry juice and riding a longboard, Apodaca went viral, amassing something like 30 million views in mere hours. This subsequently sparked a trend in which TikTok users posted videos of themselves doing the same thing, using “Dreams” as the sound. According to Billboard, this sparked the largest ever streaming week for Fleetwood Mac’s 1977 hit with over 8.47 million streams. Of his overnight success, Apodaca says, “it’s just a video that everyone felt a vibe with.” To invoke a vibe is thus to make a particular atmosphere more comprehensible to someone else, producing a resonant effect that draws people together.

As both an extension and tool of culture, vibes are produced by and imbricated within broader social, political, and economic matrices. Recorded music has always been confined—for better and worse—to the technologies, formats, and mediums through which it has been produced for commercial sale. On a platform like TikTok, wherein the emphasis is on potentially quirky microsections of songs, artists are invited to key their work towards those parameters in order to maximize commercial success. Nowadays, pop songs are produced with an eye towards their ability to go viral, be remixed, re-released with a feature verse, meme’d, or included in a mashup. As such, when an artist ‘blows up’ on TikTok, it does not necessarily mean that the sound of the song is good (whatever that might mean). Rather, it might instead be the case that a hybrid assemblage of sound, performance, narrative, and image has coalesced successfully into an atmosphere or texture – that we recognize as a/the vibe – something that not only resonates but also sells well. As TikTok’s success continues to proliferate, the app is continually being developed in ways that make it an indispensable part of the popular music industry’s ecosystem. Whether by exposing users to new musical content through the circulation of sounds, or capitalizing upon the speed at which the app moves to brand a song a ‘single’ before it’s even released, TikTok leverages the vibe to get users to listen differently.

@jimmyfallon This one’s for you @420doggface208 #cranberrydreams#doggface208#dogfacechallenge♬ original sound – Jimmy Fallon

We might indeed consider vibes to be conceptual, affective algorithms created in the interstice between lived experience and new media. “Meet Me At Our Spot,” the track through which I’ve framed this article, is full of allusions to youth culture: drunk texts, anxiety over aging, and late-night drives on the 405. It is buoyed by a propulsive bass line that thumps with a restless energy and evokes a mood of escapism. Willow Smith’s intriguing timbre and the pleasing harmonies she achieves with Tyler Cole invite listeners to ride shotgun. For the two minutes and twenty-two seconds of the song, we are immersed within their world. In the final measures the pop of the snare recedes into the background and Tyler’s voice fades away. The vibe of the track – both sonically and thematically – is predicated on the experience of a few, fleeting moments. Willow leaves us with a final provocation, one that resonates with popular music’s current mode: “Caught a vibe, baby are you coming for the ride?”

Featured Image: Screencap of Nathan Apodaca’s viral TikTok post, courtesy of SO! eds.

Jay Jolles is a PhD candidate in American Studies at the College of William and Mary currently at work on a dissertation tentatively titled “Man, Music, and Machine: Audio Culture in a/the Digital Age.” He is an interdisciplinary scholar with interests in a wide range of fields including 20th and 21st century literature and culture, critical theory, comparative media studies, and musicology. Jay’s scholarly work has appeared in or is forthcoming from The Los Angeles Review of Books, U.S. Studies Online, and Comparative American Studies. His essays can be found in Per Contra, The Atticus Review, and Pidgeonholes, among others. Prior to his time at William and Mary, he was an adjunct professor of English at Drexel University and Rutgers University-Camden.

REWIND! . . .If you liked this post, you may also dig:

Listen to yourself!: Spotify, Ancestry DNA, and the Fortunes of Race Science in the Twenty-First Century”–Alexander W. Cohen

Evoking the Object: Physicality in the Digital Age of Music–-Primus Luta

“Music is not Bread: A Comment on the Economics of Podcasting”-Andreas Duus Pape

“Pushing Record: Labors of Love, and the iTunes Playlist”–Aaron Trammell

Critical bandwidths: hearing #metoo and the construction of a listening public on the web–Milena Droumeva

TiK ToK: Post-Crash Party Pop, Compulsory Presentism and the 2008 Financial CollapseDan DiPiero (The other “TikTok”! The people need to know!)

%d bloggers like this: