Editor’s Note: February may be over, but our forum is still on! Today I bring you installment #5 of Sounding Out!‘s blog forum on gender and voice. Last week Art Blake talked about how his experience shifting his voice from feminine to masculine as a transgender man intersects with his work on John Cage. Before that, Regina Bradley put the soundtrack of Scandal in conversation with race and gender. The week before I talked about what it meant to have people call me, a woman of color, “loud.” That post was preceded by Christine Ehrick‘s selections from her forthcoming book, on the gendered soundscape. We have one more left! Robin James will round out our forum with an analysis of how ideas of what women should sound like have roots in Greek philosophy.
This week Canadian artist and writer AO Roberts takes us into the arena of speech synthesis and makes us wonder about what it means that the voices are so often female. So, lean in, close your eyes, and don’t be afraid of the robots’ voices. –Liana M. Silva, Managing Editor
I used Apple’s SIRI for the first time on an iPhone 4S. After hundreds of miles in a van full of people on a cross-country tour, all of the music had been played and the comedy mp3s entirely depleted. So, like so many first time SIRI users, we killed time by asking questions that went from the obscure to the absurd. Passive, awaiting command, prone to glitches: there was something both comedic and insidious about SIRI as female-gendered program, something that seemed to bind up the technology with stereotypical ideas of femininity.
Speech synthesis is the artificial simulation of the human voice through hardware or software, and SIRI is but one incarnation of the historical chorus of machines speaking what we code to be female. Starting from the early 20th century Voder, to the Cold-War era Silvia and Audrey, up to Amazon’s newly released Echo, researchers have by and large developed these applications as female personae. Each program articulates an individual timbre and character, soothing soft spoken or matter of fact; this is your mother, sister, or lover, here to affirm your interests while reminding you about that missed birthday. She is easy to call up in memory, tones rounded at the edges, like Scarlett Johansson’s smoky conviviality as Samantha in Spike Jonze’s Her, a bodiless purr. Simulated speech articulates a series of assumptions about what neutral articulation is, what a female voice is, and whose voice technology can ventriloquize.
The ways computers hear and speak the human voice are as complex as they are rapidly expanding. But in robotics gender is charted down to actual wavelength, actively policed around 100-150 HZ (male) and 200-250 HZ (female). Now prevalent in entertainment, navigation, law enforcement, surveillance, security, and communications, speech synthesis and recognition hold up an acoustic mirror to the dominant cultures from which they materialize. While they might provide useful tools for everything from time management to self-improvement, they also reinforce cisheteronormative definitions of personhood. Like the binary code that now gives it form, the development of speech recognition separated the entire spectrum of vocal expression into rigid biologically based categories. Ideas of a real voice vs. fake voice, in all their resonances with passing or failing one’s gender performance, have through this process been designed into the technology itself.
A SERIES OF MISERABLE GRUNTS
The first voice to be synthesized was a reed and bellows box invented by Wolfgang Von Kempelen in 1791 and shown off in the courts of the Hapsburg Empire. Von Kempelen had gained renown for his chess-playing Turk, a racist cartoon of an automaton that made waves amongst the nobles until it was revealed that underneath the tabletop was a small man secretly moving the chess player’s limbs. Von Kempelen’s second work, the speaking machine, wowed its audiences thoroughly. The player wheedled and squeezed the contraption, pushing air through its reed larynx to replicate simple words like mama and papa.
Synthesizing the voice has always required some level of making strange, of phonemic abstraction. Bell Laboratories originally developed The Voder, the earliest incarnation of the vocoder, as a cryptographic device for WWII military communications. The machine split the human voice into a spectral representation, fragmenting the source into number of different frequencies that were then recombined into synthetic speech. Noise and unintelligibility shielded the Allies’ phone calls from Nazi interception. The Vocoder’s developer, Ralph Miller, bemoaned the atrocities the machine performed on language, reducing it to a “series of miserable grunts.”
In his history of the The Vocoder, How to Wreck a Nice Beach, Dave Tompkins tells how the apparatus originally took up an entire wall and was played solely by female phone operators, but the pitch of the female voice was said to be too high to be heard by the nascent technology. In fact, when it debuted at the 1939 World’s Fair, only men were chosen to experience the roboticization of their voice. The Voder was, in fact, originally created to only hear pitches in the range of 100-150 HZ, a designed exclusion from the start. So when the Signal Corps of the Army convinced President Eisenhower to call his wife via Voder from North Africa, Miller and the developers panicked for fear she wouldn’t be heard. Entering the Pentagon late at night, Mamie Eisenhower spoke into the telephone and a fragmented version of her words travelled across the Atlantic. Resurfacing in angular vocoded form, her voice urged her husband to come home, and he had no problem hearing her. Instead of giving the developers pause to question their own definitions of gender, this interaction is told as a derisive footnote of in the history of the sound and technology: the punchline being that the first lady’s voice was heard because it was as low as a man’s.
In fall 2014 Amazon launched Echo, their new personal assistant device. Echo is a 12-inch long plain black cone that stands upright on a tabletop, similar in appearance to a telephoto camera lens. Equipped with far field mics, Echo has a female voice, connected to the cloud and always on standby. Users engage Echo with their own chosen ‘wake’ word. The linguistic similarity to a BDSM safe word could have been lost on developers. Although here inverted, the word is used to engage rather than halt action, awakening an instrument that lays dormant awaiting command.
Amazon’s much-parodied promotional video for Echo is narrated by the innocent voice of the youngest daughter in a happy, straight, white, middle-class family. While the son pitches Oedipal jabs at the father for his dubious role as patriarchal translator of technology, each member of the family soon discovers the ways Echo is useful to them. They name it Alexa and move from questions like: “Alexa how many teaspoons in a tablespoon” and “How tall is Mt. Everest?” to commands for dance mixes and cute jokes. Echo enacts a hybrid role as mother, surrogate companion, and nanny of sorts not through any real aspects of labor but through the intangible contribution of information. As a female-voiced oracle in the early pantheon of the Internet of Things, Echo’s use value is squarely placed in the realm of cisheteronormative domestic knowledge production. Gone are the tongue-in-cheek existential questions proffered to SIRI upon its release. The future with Echo is clean, wholesome, and absolutely SFW. But what does it mean for Echo to be accepted into the home, as a female gendered speaking subject?
Concerns over privacy and surveillance quickly followed Echo’s release, alarms mostly sounding over its “always on” function. Amazon banks on the safety and intimacy we culturally associate with the female voice to ease the transition of robots and AI into the home. If the promotional video painted an accurate picture of Echo’s usage, it would appear that Amazon had successfully launched Echo as a bodiless voice over the uncanny valley, the chasm below littered with broken phalanxes of female machines. Masahiro Mori coined the now familiar term uncanny valley in 1970 to describe the dip in empathic response to humanoid robots as they approach realism.
If we listen to the litany of reactions to robot voices through the filters of gender and sexuality it reveals the stark inclines of what we might think of as a queer uncanny valley. Paulina Palmer wrote in The Queer Uncanny about reoccurring tropes in queer film and literature, expanding upon what Freud saw as a prototypical aspect of the uncanny: the doubling and interchanging of the self. In the queer uncanny we see another kind of rift: that between signifier and signified embodied by trans people, the tearing apart of gender from its biological basis. The non-linear algebra of difference posed by queer and trans bodies is akin to the blurring of divisions between human and machine represented by the cyborg. This is the coupling of transphobic and automatonophobic anxieties, defined always in relation to the responses and preoccupations of a white, able bodied, cisgendered male norm. This is the queer uncanny valley. For the synthesized voice to function here, it must ease the chasm, like Echo: sutured by a voice coded as neutral, but premised upon the imagined body of a white, heterosexual, educated middle class woman.
My own voice spans a range that would have dismayed someone like Ralph Miller. I sang tenor in Junior High choir until I was found out for straying, and then warned to stay properly in the realms of alto, but preferably soprano range. Around the same time I saw a late night feature of Audrey Hepburn in My Fair Lady, struggling to lose her crass proletariat inflection. So I, a working class gender ambivalent kid, walked around with books on my head muttering The Rain In Spain Falls Mainly on the Plain for weeks after. I’m generally loud, opinionated and people remember me for my laugh. I have sung in doom metal and grindcore punk bands, using both screeching highs and the growling “cookie monster” vocal technique mostly employed by cismales.
Given my own history of toying with and estrangement from what my voice is supposed to sound like, I was interested to try out a new app on the market, the Exceptional Voice App (EVA ), touted as “The World’s First and Only Transgender Voice Training App.” Functioning as a speech recognition program, EVA analyzes the pitch, respiration, and character of your voice with the stated goal of providing training to sound more like one’s authentic self. Behind EVA is Kathe Perez, a speech pathologist and businesswoman, the developer and provider of code to the circuit. And behind the code is the promise of giving proper form to rough sounds, pitch-perfect prosody, safety, acceptance, and wholeness. Informational and training videos are integrated with tonal mimicry for phrases like hee, haa, and ooh. User progress is rated and logged with options to share goals reached on Twitter and Facebook. Customers can buy EVA for Gals or EVA for Guys. I purchased the app online for my iPhone for $5.97.
My initial EVA training scores informed me I was 22% female; a recurring number I receive in interfaces with identity recognition software. Facial recognition programs consistently rate my face at 22% female. If I smile I tend to get a higher female response than my neutral face, coded and read as male. Technology is caught up in these translations of gender: we socialize women to smile more than men, then write code for machines to recognize a woman in a face that smiles.
As for EVA’s usage, it seems to be a helpful pedagogical tool with more people sharing their positive results and reviews on trans forums every day. With violence against trans people persisting—even increasing—at alarming rates, experienced worst by trans women of color, the way one’s voice is heard and perceived is a real issue of safety. Programs like EVA can be employed to increase ease of mobility throughout the world. However, EVA is also out of reach to many, a classed capitalist venture that tautologically defines and creates users with supply. The context for EVA is the systems of legal, medical, and scientific categories inherited from Foucault’s era of discipline; the predetermined hallucination of normal sexuality, the invention of biological criteria to define the sexes and the pathologization of those outside each box, controlled by systems of biopower.
Despite all these tools we’ll never really know how we sound. It is true that the resonant chamber of our own skull provides us with a different acoustic image of our own voice. We hate to hear our voice recorded because suddenly we catch a sonic glimpse of what other people hear: sharper more angular tones, higher pitch, less warmth. Speech recognition and synthesis work upon the same logic, the shifting away from interiority; a just off the mark approximation. So the question remains what would a gender variant voice synthesis and recognition sound like? How much is reliant upon the technology and how much depends upon individual listeners, their culture, and what they project upon the voice? As markets grow, so too have more internationally accented English dialects been added to computer programs with voice synthesis. Thai, Indian, Arabic and Eastern European English were added to Mac OSX Lion in 2011. Can we hope to soon offer our voices to the industry not as a set of data to be mined into caricatures, but as a way to assist in the opening up in gender definitions? We would be better served to resist the urge to chime in and listen to the field in the same way we suddenly hear our recorded voice played back, with a focus on the sour notes of cold translation.
Featured image: “Golden People love Gold Jewelry Robots” by Flickr user epSos.de, CC BY 2.0
AO Roberts is a Canadian intermedia artist and writer based in Oakland whose work explores gender, technology and embodiment through sound, installation and print. A founding member of Winnipeg’s NGTVSPC feminist artist collective, they have shown their work at galleries and festivals internationally. They have also destroyed their vocal chords, played bass and made terrible sounds in a long line of noise projects and grindcore bands, including VOR, Hoover Death, Kursk and Wolbachia. They hold a BFA from the University of Manitoba and a MFA in Sculpture from California College of the Arts.
REWIND!…If you liked this post, you may also dig:
Hearing Queerly: NBC’s “The Voice”—Karen Tongson
On Sound and Pleasure: Meditations on the Human Voice—Yvon Bonefant
I Been On: BaddieBey and Beyoncé’s Sonic Masculinity—Regina Bradley
Welcome to World Listening Month 2014, our annual forum on listening in observation of World Listening Day on July 18th, 2014. World Listening Day is a time to think about the impacts we have on our auditory environments and, in turn, its affects on us [for the full deets, peep our recent SO! Amplifies post by Eric Leonardson, Executive Director of the World Listening Project]. We kick off our month of thinking critically about listening with a post by media historian Brian Hanrahan, who listens deeply to sonic traces of the past to prompt us to question our desires for contemporary media representations of “reality.” It also marks the global 100 year anniversary of World War I this August 2014: a moment of silence. –J. Stoever, Editor-in-Chief
For some reason that I don’t fully understand, I am very emotionally moved by the space around a sound. I almost think that sometimes I am recording space with a sound in it, rather than sound in a space. -Walter Murch
If you want to listen to the past, there’s never been a time like the present. Every year, it seems, new old recordings are identified, new techniques developed to recover sounds thought irrecoverable. Here is Bismarck’s voice, preserved on a cylinder in 1889. Here, older still, is Edison’s. There is the astonishing recuperation of phonautograms – reverberation traced onto soot-blackened paper in the mid-nineteenth century, digitally processed and played back in our own. But as that processing underlines, no sound recording straightforwardly reproduces the real. An acoustic artifact is a compound of materiality, form and meaning, but also a place where technology meets desire. Old recordings meet the listener’s longing halfway; they invoke a reality always out of reach. And not simply a longing to hear, but also to touch, and be moved by, the fact of an absent existence.
Take, for instance, HMV 09308. In October 1918, just before the end of the Great War, William Gaisberg, a sound recordist of the pre-electric era, took recording equipment to the Western Front in order to capture the sound of British artillery shelling German lines with poison gas. Gaisberg died not long after, probably from Spanish flu, although some say he was weakened by gas exposure during the recording. Nonetheless the “Gas Shell Bombardment” record – a 12-inch HMV shellac disc, just over 2 minutes at 78 rpm – was released a few weeks later, just as the war came to an end. Initially intended to promote War Bonds, ultimately the record was used to raise money for disabled veterans.
For decades, the HMV recording had a reputation as one of the very earliest “actuality” recordings – one documenting a real location and event beyond the performative space of the studio, imprinted with the audible material trace of an actual moment in space and time. Documents like this – no matter what the technology – usually come with additional symbolic authentication. Here, the record’s label does some of that work. This “historic recording,” says the subtitle, is an “actual record taken on the front line.” Publicity pieces drove home the message. In the popular HMV magazine The Voice, Gaisberg – or probably his posthumous ghost-writer – described the expedition in detail, claiming the track to be a “true representation of the bombardment.”
In the same issue, a Major C.J.C. Street compared the recording to his own experience on the Front. “Its realism,” he wrote, ”took my breath away… I played the record many times… finding at each attempt some well-remembered detail.” He didn’t say so in his article, but Street – an artillery officer, a novelist and a propaganda man for the intelligence agency MI7 – was in fact the impresario of the record. This was not the first time he had found astute uses for sound media. The previous year he had put together a record that set artillery drill commands to popular tunes – the recording was both a propaganda release and an army training tool for new recruits. With the Gas Shell record, Street knew he wasn’t just selling recorded sound, but also an auratic sense of closeness to an overwhelming reality, the palpable proximity of war and death. Authenticating detail helped to underpin this sense of an absent real made present. Street cued the listener for those “well-remembered details.” In particular, he singled out one indistinct rattly flap-whizz noise, hearing in it, he claimed, the sound of a round with a “loose driving-band.”
The record stayed in the HMV catalog until 1945, but only in the early 1990s were its production history and authenticity claims seriously examined. In specialist journals, archivists, collectors and amateur historians undertook a collective forensic and critical analysis. A promising auditory witness was located: 95-year-old Lt.-Col. Montagu Cleeve another former artillery officer, in his time a developer of “Boche Buster” railway gun, later a music professor – was invited to critically assess the recording. Cleeve vouched unreservedly for its authenticity. He heard in it, he said, an unmistakable succession of sounds – the clang of the breech, the gigantic report of the firing explosion, the distinctive whiny whistle of a gas shell on its way across no-man’s-land. Others looked to data rather than the memories of old soldiers. One expert on pre-electric recording noted the angles commanded in firing instructions, correlated them with known muzzle velocities for 4.5 and 6-inch howitzers, then used this and other information to “definitively” explain the counter-intuitive anti-Doppler sound of the shells’ whistling. He also identified the audible echo effect – the curious “double report” of the guns heard here – as the sound of a brass recording horn violently resonating at a distance of exactly 26.5 meters from the guns.
Eventually, skepticism won out. Close listening at slow speeds – just careful attention and notation, nothing more elaborate – revealed inconsistencies and oddities in the firing noises. The bongs, plops and whistles seemed internally inconsistent. Some of the artillery sounds – ostensibly a battery of four, firing in quick succession – varied implausibly with each successive firing. Physical evidence from the record’s groove, as well as extraneous noises – surface crackle and fizz, and, audible within the recording, the swish of a turntable – seemed to indicate at least two rudimentary overdubs, in which the output of one acoustic horn was relayed into a second, possibly using an auxetophone, an early compressed-air amplifier. All this resulted in a double- or triple-layered sonic artifact. Finally – the crucial evidence, although oddly it was hardly noticed at the time – an alternative take was located. In this take, according to its discoverer, the entire theatrics of gunnery command is simply absent, and there is no sound at all of whistling shells in motion. What was left was a skeleton sequence of clicks, thuds and cracks, supplemented with only a single closing insert, the portentous injunction “Feed the Guns with War Bonds!”
In short, it seems highly likely that any original field recording was, at the very least, post-dramatized with performed voices and percussive and whistling sound effects. So, it is tempting to say, that clears that up. The recording’s inauthenticity is proven. File under Fake. But in fact, if we don’t stop there, if we set aside narrow and absolutist ideas of authenticity, and instead explore the recording’s ambiguity and hybridity, then Gas Shell Bombardment becomes all the more interesting as an historical artifact.
Let’s assume, for the sake of argument, that some form of basic recording was done in France, very possibly a staged barrage specifically performed for Gaisberg’s visit, and that this recording then had effects added back at HMV in London. The record might then be seen less as a straightforward documentary, and instead as an unusual version of the “descriptive speciality,” a genre of miniature phonographic vignette dating back to the 1890s, far predating longer-form radio drama. Very little is known about these early media artworks, but it is a fair generalization to say that in America the genre was more slanted towards vaudeville comedy, whereas in Europe, imperial and military scenes predominated. As early as 1890, for example, there had been German phonographic representations of battles from the Franco-Prussian war. The Great War saw a flourishing of the genre. Scholars are just beginning to take an interest these old phonographs; here’s one recent essay on the “Angel of Mons,” for example, a British acoustic vignette of a famous incident on the Western Front.
Listen to a 1915 German descriptive speciality, depicting the attack on the fortress of Liège the previous year:
As a descriptive speciality, Gas Shell Bombardment is unusual because it incorporates an actual indexical trace. But such traces – as emphasized by Charles Sanders Pierce and many later media-theoreticians– do not resemble their referent, they are caused by it. The bullet hole does not look much like a bullet; thunder is lightning’s trace, not its likeness. But for Street and Gaisberg, the trace’s lack of resemblance caused problems: the original recording’s lack of detail, cues and clues, but above all its lack of internal dimensionality, created a perceptual shortfall and a lack of credibility. Maybe they hoped that the guns, by sheer force of amplitude, would overcome the spatially impoverished, reverbless reproduction of pre-electric recording. If so, it didn’t work. Without added effects, the guns’ trace was as flat and “body-less” as a sequence of Morse. It was a sound without a scene. The producers’ interventions aimed to thicken the primary artifact with referential-sounding detail, but also to heighten the sense of materiality and spatiality, and to strengthen the sense of diegetic presence, of worlded thereness. The soldiers’ voices – louder and quieter, close-up and farther-out – and the fake-Doppler of the “shell whistling” lent the recording narrative direction (literally, some trajectory) and “authenticating” points of detail. But above all they gave a sense of internal space to the recording, a space into which the listener could direct her attention.
In this context, we can only admire the creativity and performative élan of the unknown production crew. We know little about effects production in early phonography. It is a safe bet that some techniques were adopted from theatre, and that there was overlap with silent film accompaniment. But whatever the method used, it would have called for the awkward orchestration of a limited number of iconic sounds to create an impression of a spatially coherent and materially detailed sonic environment. The recordist and his team would first have had to imagine how relative loudness – of voices, of material objects struck and sounded – might create a sense of spatial depth when transduced through the horn’s crude interface. Then they would have had to perform this as a live overdub, keeping time with the base track of the gun recording played through another horn. And all this done with participants and equipment crowded tightly around the mouth of the huge horn, crammed into the tiny pick-up arc, a scene looking something like this image of Leopold Stokowski’s pre-electric recording sessions or this photograph of the recording of a cello concerto.
As well as this hybrid of trace and live performance, there is another performance here – Gaisberg’s journey itself. With twenty years of recording experience, Gaisberg was probably very well aware that the expedition would not yield a “realistic” recording of the guns. But the expedition had to be made, so that it could be said to have taken place. Expectations had to be primed and colored, so that, to use André Bazin’s famous phrase about photographs, the recording could partake in an “irrational power to… bear the belief” of the listener. The journey, and the accounts of Gaisberg and Street are not a supplement to the “true representation” of the gas bombardment. They are part of that representation. Moreover, in subsequent writing it is noticeable that the manner of Gaisberg’s death becomes a rhetorical amplification for the authenticity of the recording’s trace, as if his fatal inhalation (of gas molecules or flu bacilli) were itself a deadly indexation, paralleling the recording’s claim to capture the breath of the War, and even of History itself.
In media-historical terms, the Gas Shell Bombardment recording can be understood as a late, transitional artifact from phonography’s pre-microphonic era. The desire for the sonic trace, for an ever more immersive proximity to events was there, but electro-acoustic technology was not yet in place. Two years later, in 1920, Horace Merriman and Lionel Guest made the first experimental electrical recording, arguably also the first true field recording. The event, appropriately enough, was an official war memorial service in London, where Merriman and Guest – working for Columbia Records – put microphones in Westminster Abbey, running cables to a remote recording van parked in the street outside, where they sat amidst heating ovens and cutting lathes. By the end of the 1920s, remote recording and broadcasting, while never straightforward, were well on the way to ubiquity.
Claims made on behalf of technologies of reproduction may seem simplistic, but there’s a grain of truth to their simplicity. If there were nothing special – even magical – in the referentiality of the camera that captures the moment, the recording that’s like being there, the liveness of the live broadcast, these things would not play the role they do in everyday life and in the ideological fabric of society. But there is falsehood too, in over-simplifying the nature and affective charge of old photographs, old footage, old recordings. These are made things, composed of different materials, media, signs and conventions; they are inseparable from the desires and expectations they induce and direct. They function in part by mimesis and verisimilitude, but also through the gaps, blank spots and false illusions of their trace. They can – rightly – intensify our feeling towards the past, but should also prompt us to think about our own desires and investments.
Image by Flickr User DrakeGoodman, “Horchposten im Spengtrichter vor Neuve-Chapelle 6km nördlich von La Bassée Nordfrankreich 1916,” A trio of lightly equipped soldiers from an unidentified formation oblige the photographer by looking serious and pretending they’re just metres from the enemy, listening for activity in his lines. The improvised “listening device” is actually a large funnel, probably liberated from a nearby farm.
Brían Hanrahan is a film, media and cultural historian, whose work focuses on the history of acoustic media, German and European cinema and the culture of the Weimar Republic.
Edited post-publication at 8:00 pm EST on July 7, 2014
REWIND!…If you liked this post, you may also dig:
A Brief History of Auto-Tune–Owen Marshall
DIY Histories: Podcasting the Past–Andrew Salvati
Last month, T.M. Luhrmann compared the experience of reading a written book versus listening to books in the New York Times article “Audiobooks and the Return of Storytelling.” Lurhmann points out how audiobook sales jumped 20% in 2012, whereas total industry book sales went down 1%. From the looks of it, books have benefited from audiobook sales, but in literary studies, print remains the primary vehicle for analysis. Might listening to an audiobook actually change how we critically read a text?
As I listened to Junot Díaz narrate This Is How You Lose Her (2012), the first book Díaz has read as an audiobook and the first book of short stories the author has published since 1996’s Drown, I wondered how his reading influenced how I interpreted the text. Díaz’s reading sounds less like regular speech and more like a performance, with its own cadence and rhythm:
This post approaches the audiobook as a text in itself, coming from a sound studies perspective. I attempt to conceptualize the idea of “close listening” as a methodology akin to “close reading” in literary studies. I listen for how Diaz reads the text but more specifically how the reading itself becomes a way of authoring the text. Ultimately, I argue that Díaz’s reading becomes a re-authoring the text—re-writing the text sonically. On a broader level, I hope to add to the conversation of what it means to read an audiobook, as Birgitte Stougaard Pederson and Ibsen Have brought up in “Conceptualising the Audiobook Experience.” Using This Is How You Lose Her, I show that reading an audiobook means engaging with the text from the angle of the ear, and that close listening can become an aural reading practice that relies not so much on the visual texts, but on aural cues from the narrator.
This Is How You Lose Her revolves around Yunior, a young Dominican immigrant who grows up in New Jersey and who ends up as a professor in Boston, and the many loves he has had or that he has encountered growing up. The stories trace his progress from a young, recently arrived Yunior, to a tenured, mature Yunior, showcasing certain relationships that influence how he relates to women—in sum, illustrating how he loses the women he loves. Throughout the short story collection, Díaz also calls attention to other relationships that may influence Yunior’s perspective, for example, his brother’s attachments with women, especially toward the end of his young life as he battled cancer, and his father’s relationship with his mistress, a Dominican woman who lived in New Jersey. At the end, Díaz illuminates how a mujeriego (womanizer) like Yunior comes to be; the short stories indicate that Yunior is as much a product of his environment as he is a seller of the merchandise.
Díaz is not a professional audiobook narrator. Although Díaz has done live readings, reading the full-length version of a book one has written is a different exercise. The Penguin Audio version of the collection is based on the actual short story collection (in other words, unabridged), so it does not contain additional stories or behind the scenes interviews. Technically, it is no different than the print version.
Listening to authors read their own work has value beyond the pleasure of hearing them read their text. Scholarly writing on audiobooks has emphasized the experience of listening to an audiobook for pleasure (like Deborah Phillips’ “Talking Books: The Encounter of Literature and Technology in the Audiobook” and James Shokoff’s “What Is An Audiobook?”), but it wasn’t until the 2011 edited collection Audiobooks, Literature, and Sound Studies that audiobooks were considered on their own instead of as extensions of the literature they were based on. The allure of doing this scholarly exercise with the audiobook version of This Is How You Lose Her is that Díaz’s delivery of the text is uncommon at the least.
Talking about Junot Díaz’s readerly voice requires to tune into conversations about his writerly voice. In many reviews of Díaz’s books, writers discuss how Díaz deftly conveys a writer’s voice in his text, indicating that his success is that his characters have a very clear voice—or at least Yunior does. Michiko Kakutani, for example, points out how “Junot Díaz has one of the most distinctive and magnetic voices in contemporary fiction: limber, streetwise, caffeinated and wonderfully eclectic, capable of conjuring for the reader everything from the sorrows of Dominican history to the banalities of life in New Jersey.” Although this quotation is in reference to Díaz’s second book, The Brief Wondrous Life of Oscar Wao, it describes Díaz’s writing in terms of his voice instead of, for instance, in terms of his use of metaphors or choice of subject.
Richard Wolinsky, in his Guernica interview with Díaz, sees an overlap between Yunior and Díaz: “He’s [Yunior] got a very distinct voice, and it’s a voice that’s informed by [Diaz’s] own reading, particularly science fiction and fantasy.” Although Díaz has pointed out that Yunior is loosely based on events that have happened to him, Wolinsky “hears” Díaz in his main character. The tone and the language Yunior uses is read as a reflection of Díaz.
Conversations about the voice of the writer point to a sensibility about sound, but are often limited to a written text. Anna Barnet, in an interview with Junot Díaz, states “His two principal linguistic registers (‘this kind of crazy Caribbean language and music’ and ‘this sort of African-American-infused American vernacular’) grind against each other along with the many other voices he ventriloquizes in his writing.” Barnet reminds readers that Díaz’s writing style is based in spoken language—particularly Díaz’s spoken language. This language of “voice” to describe a writer’s style (or, specifically, a writer’s ability to convey a clear sense of who the character is and/or their views) is commonplace but gives the impression that there is a sonic aspect to an author’s work, when in reality it is but a metaphor for something that occurs at the level of text.
A critical reading of a text that includes the audiobook rendition allows critics to add substance to those references to “voice.” In Junot Díaz’s case, it is possible that readers encounter him first through written text, and so have an expectation of what Díaz (or Yunior) would sound like live. In my textual analysis of eight audiobook reviews (and one book review that included a mention of the narration in the audiobook) most listeners showed some sort of discomfort with Díaz’s narration. One reviewer, for example, had issue with the “smoothness” of Díaz’s narration: “At times the reading was a little shaky and uneven”. Another reviewer stated “at times his cadence is choppy, with odd pauses and emphasis on strange words that detract from the overall experience.” Reviewers also had an issue with Díaz’s pace, which is characterized by pauses in places that many not seem normal in casual American speech. These statements hint at a “weird” quality in Díaz’s speech, something that does not come through when Díaz has a casual conversation. (Listen to this podcast episode of NPR’s Alt. Latino guest-starring Díaz and compare with this video of him reading part of This Is How You Lose Her.) Although one blogger pointed out that Díaz sounded “professorial” in the reading, others used the words “native,” “authenticity,” “Dominican” and even “Jersey accent” to describe how Díaz sounded. It is unclear how these reviewers define “native” or “authentic.”
Connecting sound to authenticity implies that Dominicans can only sound a certain way, or that the audio narration is lacking when it does not represent a “typical” Dominican voice. To the extent that Díaz is Dominican, his voice is of a Dominican male who has grown up in the Northeastern United States. His uneven audio narration creates a feeling of sonic unintelligibility in the listener, similar to the effect of including Spanish words in the written text. Díaz-as-narrator can make a listener uncomfortable, and by extension forces that reader to listen.
The sonic unintelligibility also relies on the text, on how Díaz plays with language by switching back and forth from English to Spanish. Díaz mentions in an interview with Marva Hinton that some readers are not happy with his choice of Spanglish in his writing: “There [are] folks who hear one Spanish word, and they’re convinced this is some sort of immigrant conspiracy” Farther down, in the same article, Díaz refers to his mix of Spanish and English (and a particular kind of Spanish and English at that, since he moves among Standard American English, African American Vernacular English, and Dominican Spanish) as “opaque language.” There’s a connection between the kind of “opaqueness” that Spanish gives and the unintelligible effect of Díaz read his work.
An example of how sonic unintelligibility operates in the audiobook is the first story, “The Sun, The Moon, The Stars.” This opener, told in first person, revolves about one of Yunior’s break-ups; Yunior and his girlfriend Magdalena, on whom he cheated, go to the Dominican Republic on a trip they had planned before she found out about the affair. It frames the book as being an in-depth analysis of loves lost, from the man who keeps losing them. It also sets the tone sonically for the audiobook reading: after the introduction of the book, a snippet of bachata music comes on, and then makes way for Díaz, who reads the title of the story. This is the pattern of the book: slices of bachata, followed by Díaz’s narration.
His voice is characterized by a slight sing-song cadence that is reminiscent of Dominican Spanish accent. If this were in Spanish, it might be easier to lose track of the cadence, but in English it sounds like a disembodied accent. I showcase the swing in Díaz’s narration by alternating capital letters and lower-case letters: “Her FAther, who usually would treat me like his HIjo, CALLS me an ASShole on the PHONE, SOUNDS like he’s STRANgling himself with the cord.” The voice seems to float for a while until Díaz arrives to the end of a paragraph or a series of sentences, and then it sinks. Moreover, this pattern does not change when Díaz switches characters: it’s hard to tell Yunior apart from Magdalena unless the reader pays close attention to when the narrator is switching characters and/or when the narrator uses a pronoun. The same effect comes from the odd pauses in the author’s narration: “Oh God, she wailed. Oh. My God.”
The choppiness and the emphasis in the reading are a way to dislocate the listener, in a similar way that Spanish phrases or lack of quotation marks in the text dislocate a reader who does not understand Spanish or who depends on the quotation marks to make sense of the prose. Also, this story focuses on Magdalena withdrawing from Yunior and not communicating with him. The tone, cadence, and sound of Díaz’s voice can be read to mirror the relationship between Yunior and Magdalena (and the other women in the text): the sonic unintelligibility is manifest at the level of plot through Yunior’s relationships.
Although many audiobook reviewers may consider the plot in their reviews, part of what makes an audiobook stand out is the performance of the text. I take my cues from audiobook reviewers and consider critically my listening experience of This Is How You Lose Her and how this can become the basis for a critical interpretation of the text. My analysis underscores that having an author read a text can provide a different way into analyzing the text and prompts readers to pay attention to sound. If, like Shokoff asserts, most audiobook readers listen to an audiobook while doing something else, Díaz shows that listening closely to the audio text can be as rewarding as reading a book.
Featured Image: “Junot Diaz” by WBUR Boston’s NPR News Station, Attribution-NonCommercial-NoDerivs License
Liana Silva-Ford is co-founder and Managing Editor of Sounding Out!.
REWIND! . . .If you liked this post, you may also dig:
Editor’s Note: Even though this is officially Osvaldo Oyola‘s final post as an SO! regular–his brilliant dissertation on Latino/a identity and collection cultures is calling–I refuse to say goodbye, perpetually leaving the door open for future encores. He has been a bold and steadfast contributor–peep his extensive back catalogue here–and we cannot thank him enough for bringing such a whipsmart presence to Sounding Out! over the years. Best of luck, OOO, our lighters are up for you!–J. Stoever-Ackerman, Editor-in-Chief
As several of my previous Sounding Out! Blog posts reveal, I am intrigued by the way popular music seeks to establish its authenticity to the listener. It seems that recorded popular music seeks out ways to overcome its lack of presence as compared to a live performance, where a unified and spontaneous sense of immediacy seems to automatically bestow the aura of the “authentic”—a uniqueness that, ironically, live reproducibility engenders. Throughout my time as a Sounding Out! regular, I have explored how authenticity may be conferred through artists affecting an accent as a form of musical style, comparing their songs to other “less authentic” forms of music through a call to nostalgia, or even by highlighting artificiality through use of auto-tune.
One of the ways that artists and producers get past a potential lack of authenticity when recording is through call outs to “liveness.” I am not referring to concert recordings (though there are ways that they can be used), but elements like counting off at the beginning of songs or introducing some change or movement in a song. There is no practical need to count off “One, two, three, four!” at the beginning of a recording of a song if it is being pieced together through multiple tracks and overdubs. These days a “click track” or adjustment post-recording can keep all the players in time even if not necessarily playing at once; even if a song is being recorded as a kind of studio jam, the count off could be edited out. It is an artifact of the creation, not a sign of creation itself. Instead, the counting can become an accepted and notable part of the song, like Sam the Sham and the Phaorahs performing “Wooly Bully,” giving it an orientation to time—the sense that all these musicians were present together and playing their instruments at once and needed this unique introduction to keep them all in tempo.
Similarly, sometimes artists call out to other musicians, giving instructions when no instructions are needed, assuming that most popular music is recorded in multiple takes using multiple tracks. In Parade‘s “Mountains,” Prince commands the Revolution, “guitars and drums on the one!” when clearly they had rehearsed when putting together the song, and ostensibly knew when the drum and guitar breakdown was coming up. Prince, furthermore, joins artists as varied as the Grateful Dead and the Beastie Boys in mixing concert recordings with studio overdubs to capture a “live” sound on songs like “It’s Gonna Be a Beautiful Night” and “Alligator.” Even something as ubiquitous as guitar feedback is a transformation of an artifact of live performance into a sound available for use in recording—something that was purposefully avoided until John Lennon’s happy accident when in the studio to cut “I Feel Fine.” Until then, playing with feedback was a way to demonstrate performance skills through onstage vamping.
These varied calls to liveness provide a sense of authenticity to music made via the recording studio, denoting what I understand as the spontaneous sociability of music. Count-offs and studio shout-outs provide a sense of unified presence to a performance, especially if the performance has actually been constructed piecemeal and over time. This is something of a remnant of an old-fashioned notion that recorded music is measured in quality in comparison to live performance. It’s any idea that hung around both implicitly and explicitly long after bands started experimenting in the studio with effects that ranged from the difficult to the impossible to replicate on stage, and reinforced through recordings by performers who purposefully referenced their lauded live performances.
For example, James Brown’s “Get Up (I Feel Like Being a) Sex Machine” is built on this conceit. The entire song is a conversation, a call and response between James Brown and his band, the J.B.’s. From the opening line, Brown introduces the song as moment in time in which he is compelled to do his thing, but he demands both encouragement and cooperation from the band in order to achieve it. When Brown asks Bobby Byrd, “Bobby! Should I take ‘em to the bridge?” we as listeners are invited to play along with the idea that it has suddenly came into his head to have the band play the bridge—as it might’ve happened (and thus been practiced) countless times in his legendary live shows. It suggests a form of spontaneity that the reality of recording would otherwise drain from the song. Sure, according to RJ Smith’s The One: The Life and Music of James Brown (2012), “Get Up” was recorded in only two takes–already fairly amazing–but the very nature of the song makes it sound like it was recorded in one, even if it had to be broken up into two sides of a 7-inch. That reality doesn’t matter—what matters when listening is the feeling that we, as listeners, are being allowed to partake in the capturing of what seems like one unique, and continuous, moment.
The question then arises: What about recorded music that does the opposite, that makes a point of highlighting its artificial construct—the impossibility of its spontaneous performance? While there are examples that date back at least to the 1960s, does this shift highlight a difference in aesthetic concerns by the pop music audience? If calls to “liveness” suggest a spontaneous sociability to music, what do the meta references to their songcraft suggest about what is important to music now?
The classic example is Ringo Starr’s bellow, “I GOT BLISTERS ON MY FINGERS!” at the end of the Beatles’ “Helter Skelter,” an exclamation made after umpteen takes of the song recorded on the same day, but there are more contemporary and even more obvious examples. Near the end of Outkast’s “Prototype,” (at 4:21) Andre 3000 can be heard talking to his sound engineer John Frye about the ad libs, “Hey, hey John! Are we recording our ad libs? Really? Were we recording just then? Let me hear that, that first one. . .” There is an interesting tension here between the spontaneity of an “ad lib” and listening back to pick the best one or further develop one when re-recording, and Andre in his role as producer decided to keep it in as part of the final product. The recording itself becomes part of the subject of the song as a kind of coda. The banter is actually a brilliant parallel to the content of the song, which undermines the typical “we’ll be together forever” love song trope for one that highlights the reality of serial monogamy common in American culture and lessons each relationship potentially provides us for the next. Rather than pretend that a romantic relationship is a unique and eternal thing, the song admits the work and changes involved, just as it admits that the seemingly special spontaneity of a song is developed through a process.
Of course, hip hop as a genre, with its frequent use of sampling, tends to make its recording process very evident. While it is possible to play samples “live” using a digital sampler or isolating sections on vinyl via the DJ as band member, the use of pre-recorded fragments means that rap music relies on the vocal dynamics of rap to carry the sense of spontaneity. Yet, in 1993’s “Higher Level,” KRS-One opens with a description of the time and place of the recording—“5 o’clock in the morning” at “D&D Studios,” establishing forever when and where and thus how the recording is happening. Five o’clock in the morning places the creation of the song with a context of working and rocking all through the night to get the album completed. The song may or may not have actually been recorded last, but its placement at the end of Return of the Boom Bap, gives it a sense of a last ditch effort to complete the collection of songs. The fact that “5 o’clock in the morning” is likely also among the cheapest available studio times potentially highlights budgetary concerns in the recording itself. This is a rare thing to include in recording, though the Brand New Heavies cap off the dissolution of their 1994 track “Fake” into pseudo-jazz-messing-around with one their members chiding, “a thousand dollar a day studio!” This is a different kind of call to authenticity, as a budgetary concern is an implicit to a “realness” defined by being non-commercial.
One of my all-time favorite examples is a few years older than “Higher Level”—“Nervous” by Boogie Down Production: “written, produced and directed by Blastmaster KRS-One,” which includes an attempt to explain how a song is put together on the “48-track board.” Instead of calling instructions to a band, KRS points out that DJ Doc is doing the mixing and instructs him to “break it down, Doc!” just before a beat breakdown (listen at around 1:40). He explains, “Now, here’s what we do on the 48-track board / We look around for the best possible break / And once we find it, we just BREAK,” and then the pre-recorded beat seems to obey his command, breaking down to just the bass drum and a sampled electric piano from Rhythm Heritage’s “The Sky’s the Limit.” Later, he says, “We find track seven, and break it down!” and the music shifts to just the bass guitar and some tinny synth high-hats.
So how does highlighting the recording circumstances, or just bringing attention to the fact that the song being listened to is a multiple-step process of recording and post-production benefit the song itself? Is it like I mentioned in my 2011 “Defense of Auto-Tune” post, that this kind of attention re-establishes authenticity by making its constructed nature transparent? I’d say yes, in part, but I also think that–through its violation of the expectation of seamlessness–the stray track or reference to recording within a song is a nod to a different kind of skillfulness. Exhortations such as “Take it to the Bridge” give an ironic nod to the extemporaneous to call attention to the diligent workmanship and dedication demanded by studio songcraft. Traditionally, live audiences may appreciate a flawless or nearly flawless performance and understand a masterful recovery from (and/or incorporation of) error as the signs of a good show, but, these moments that call attention to the recording studio situation claim there something to appreciate in the fact that Ringo Starr endured 18 takes of “Helter Skelter” until he had painful blisters, or that KRS-One and DJ Doc worked out the proper way to “feel around” the mixing board to make a grooving collage of sounds as disparate as the theme from “Rat Patrol” and WAR’s “Galaxy.”
KRS may have once admonished other MCs to “make sure live you is a dope rhyme-sayer,” but clearly he believes liveness—whether implicitly or explicitly—is not the only measure of musical ability. Rather, the highlighting of labor in the construction of a recording becomes its own kind of (anti-)vamping and demonstration of skill, and of a different kind of sociability in making music that these conversational snippets and references to other people in the studio make clear. This kind of attention to the group labor is especially important as various recording technologies become increasingly available to the wider public and allow for an isolated pursuit of recording music. Just as calls to liveness in recording engage the listener in ways that suggest participation as a live audience, calls to anti-liveness also engage the listener, but by bringing them across time and space into the studio to witness to a different form of great performance.
Osvaldo Oyola is a regular contributor to Sounding Out! and a PhD Candidate in English at Binghamton University working on his dissertation, “Collecting Identity: Popular Culture and Narratives of Afro-Latin Self in Transnational America.” He also regularly posts brief(ish) thoughts on music and comics on his blog, The Middle Spaces.
REWIND! . . .If you liked this post, you may also dig:
Experiments in Agent-based Sonic Composition–Andreas Pape