Multidisciplinary Column: An Interview with Emilia Gómez

Could you tell us a bit about your background, and what the road to your current position was?

I have a technical background in engineering (telecommunication engineer specialized in signal processing, PhD in Computer Science), but I also followed formal musical studies at the conservatory since I was a child. So I think I have an interdisciplinary background.

Could you tell us a bit more about how you have encountered multidisciplinarity and interdisciplinarity both in your work on music information retrieval and your current project on human behavior and machine intelligence?

Music Information Retrieval (MIR) is itself a multidisciplinarity research area intended to help humans better make sense of this data. MIR draws from a diverse set of disciplines, including, but by no means limited to, music theory, computer science, psychology, neuroscience, library science, electrical engineering, and machine learning.

In my current project HUMAINT at the Joint Research Centre of the European Commission, we try to understand the impact that algorithms will have on humans, including our decision making and cognitive capabilities. This challenging topic can only be addressed in a holistic way and by incorporating insights from different disciplines. At our kick-off workshopwe gathered researchers working on distant fields, e.g. from computer science to philosophy, including law, neuroscience and psychology and we realised the need to engage on scientific discussions from different views and perspectives to address human challenges in a holistic way.

What have, in your personal experience, been the main advantages of multidisciplinarity and interdisciplinarity? Have you also encountered any disadvantages or obstacles?

The main advantage I see is the fact that we can combine distinct methodologies to generate new insights. For researchers, the fact of stepping out a discipline’s comfort zone makes us more creative and innovative.

One disadvantage is the fact that when you work on a multidisciplinary field you seem not to fit into traditional academic standards. In my case, I am perceived as a musician by engineers and as an engineer by musicians.

Beyond the academic community, your work also closely connects to interests by diverse types of stakeholders (e.g. industry, policy-makers). In your opinion, what are the most challenging aspects for an academic to operate in such a diverse stakeholder environment?

The most challenging part of diverse teams is communication, e.g. being able to speak the same language (we might need to create interdisciplinary glossaries!) and explain about our research in an accessible way so that it is understood by people with diverse backgrounds and expertises.

Regarding your work on music, you often have been speaking about making all music accessible to everyone. What do you consider the grand research challenges regarding this mission?

Many MIR researchers desire that technology can be used to make all music accessible to everyone, i.e. that our algorithms can help people discover new music, develop a varied musical taste and make them open to new music and, at the same time, to new ideas and cultures. We often talk of our desire that MIR algorithms help people discover music in the so called ´long tail`, i.e. music that is not so popular or present in the mainstream scenario. I believe the variety of music styles reflect the variety of human beings, e.g. in terms of culture, personalities and ideas. Through music we can then enrich our culture and understanding.

As the newly elected president of the ISMIR society, are there any specific missions regarding the community you would like to emphasize?

I have had the chance to work with an amazing ISMIR board over the last years, an incredible group of people willing to contribute to our community with their talent and time. With this team is very easy to work! 

This year, ISMIR is organizing its 19th edition (yes, we are getting old)! There are many challenges at ISMIR that we as a community should address, but at the moment I would like to emphasize some relevant aspects that are now somehow a priority for the board.

The first one is to maintain and expand its scientific excellence, as ISMIR should continue to provide key scientific advancements in our field. In this respect, we have recently launched our open access journal Transactions of ISMIR to foster the publication of more deep and mature research works in our area.

The second one is to promote variety in our community, e.g. in terms of discipline, gender or geographical location, also related to music culture and repertoire. In this respect, and thanks to our members, we have promoted ISMIR taking place at different locations, including editions in Asia (e.g. 2014 in Taipei, Taiwan, and 2017 in Suzhou, China).

Other aspects we put into value is reproducibility, openness and accessibility. In this sense, our priority is to maintain affordable registration rates, taking advantage of sponsorships from our industrial members, and devote our membership fees to provide travel funds for students or other members in need to attend ISMIR.

How and in what form do you feel we as academics can be most impactful?

The academic environment gives you a lot of flexibility and freedom to define research roadmaps, although there are always some dependencies on funding. In addition, academia provides time  to reflect and go deep into problems that are not directly related to a product in a short-term. In the technological field, academia has the potential to advance technologies by focusing on deeper understanding of why these technologies work well or not, e.g. through theoretical analysis or comprehensive evaluation

You also have been very engaged in missions surrounding Women in STEM, for example through the Women in MIR initiatives. In discussions on fostering diversity, the importance of role models is frequently mentioned. How can we be good role models?

Yes, I have become more and more concerned about the lack of opportunities that women have in our field with respect to their male colleagues. In this sense, Women in MIR is playing a major role in promoting the role and opportunities of women in our field, including a mentoring program, funding for women to attend ISMIR, and the creation of a public repository of female researchers to make them more visible and present.

I think women are already great role models in their different profiles, but they lack visibility with respect to their male colleagues.


Dr. Emilia Gómez graduated as a Telecommunication Engineer at Universidad de Sevilla and studied piano performance at the Seville Conservatoire of Music, Spain. She then received a DEA in Acoustics, Signal Processing and Computer Science applied to Music at IRCAM, Paris and a PhD in Computer Science at Universitat Pompeu Fabra in Barcelona (2006). She has been visiting researcher at the Royal Institute of Technology, Stockholm (Marie Curie Fellow, 2003), McGill University, Montreal (AGAUR competitive fellowship. 2010), and Queen Mary University of London (José de Castillejos competitive fellowship, 2015). After her PhD, she was first a lecturer in Sonology at the Higher School of Music of Catalonia and then joined the Music Technology Group, Department of Information and Communication Technologies,  Universitat Pompeu Fabra in Barcelona, Spain, first as an assistant professor and then as an associate professor (2011) and ICREA Academia fellow (2015). In 2017, she became the first female president of the International Society for Music Information Retrieval, and in January 2018, she joined the Joint Research Centre of the European Commission as Lead Scientist of the HUMAINT project, studying the impact of machine intelligence into human behavior.

Editor Biographies

Cynthia_Liem_2017Dr. Cynthia C. S. Liem is an Assistant Professor in the Multimedia Computing Group of Delft University of Technology, The Netherlands, and pianist of the Magma Duo. She initiated and co-coordinated the European research project PHENICX (2013-2016), focusing on technological enrichment of symphonic concert recordings with partners such as the Royal Concertgebouw Orchestra. Her research interests consider music and multimedia search and recommendation, and increasingly shift towards making people discover new interests and content which would not trivially be retrieved. Beyond her academic activities, Cynthia gained industrial experience at Bell Labs Netherlands, Philips Research and Google. She was a recipient of the Lucent Global Science and Google Anita Borg Europe Memorial scholarships, the Google European Doctoral Fellowship 2010 in Multimedia, and a finalist of the New Scientist Science Talent Award 2016 for young scientists committed to public outreach.



jochen_huberDr. Jochen Huber is a Senior User Experience Researcher at Synaptics. Previously, he was an SUTD-MIT postdoctoral fellow in the Fluid Interfaces Group at MIT Media Lab and the Augmented Human Lab at Singapore University of Technology and Design. He holds a Ph.D. in Computer Science and degrees in both Mathematics (Dipl.-Math.) and Computer Science (Dipl.-Inform.), all from Technische Universität Darmstadt, Germany. Jochen’s work is situated at the intersection of Human-Computer Interaction and Human Augmentation. He designs, implements and studies novel input technology in the areas of mobile, tangible & non-visual interaction, automotive UX and assistive augmentation. He has co-authored over 60 academic publications and regularly serves as program committee member in premier HCI and multimedia conferences. He was program co-chair of ACM TVX 2016 and Augmented Human 2015 and chaired tracks of ACM Multimedia, ACM Creativity and Cognition and ACM International Conference on Interface Surfaces and Spaces, as well as numerous workshops at ACM CHI and IUI. Further information can be found on his personal homepage:

Socially significant music events

Social media sharing platforms (e.g., YouTube, Flickr, Instagram, and SoundCloud) have revolutionized how users access multimedia content online. Most of these platforms provide a variety of ways for the user to interact with the different types of media: images, video, music. In addition to watching or listening to the media content, users can also engage with content in different ways, e.g., like, share, tag, or comment. Social media sharing platforms have become an important resource for scientific researchers, who aim to develop new indexing and retrieval algorithms that can improve users’ access to multimedia content. As a result, enhancing the experience provided by social media sharing platforms.

Historically, the multimedia research community has focused on developing multimedia analysis algorithms that combine visual and text modalities. Less highly visible is research devoted to algorithms that exploit an audio signal as the main modality. Recently, awareness for the importance of audio has experienced a resurgence. Particularly notable is Google’s release of the AudioSet, “A large-scale dataset of manually annotated audio events” [7]. In a similar spirit, we have developed the “Socially Significant Music Event“ dataset that supports research on music events [3]. The dataset contains Electronic Dance Music (EDM) tracks with a Creative Commons license that have been collected from SoundCloud. Using this dataset, one can build machine learning algorithms to detect specific events in a given music track.

What are socially significant music events? Within a music track, listeners are able to identify certain acoustic patterns as nameable music events.  We call a music event “socially significant” if it is popular in social media circles, implying that it is readily identifiable and an important part of how listeners experience a certain music track or music genre. For example, listeners might talk about these events in their comments, suggesting that these events are important for the listeners (Figure 1).

Traditional music event detection has only tackled low-level events like music onsets [4] or music auto-tagging [810]. In our dataset, we consider events that are at a higher abstraction level than the low-level musical onsets. In auto-tagging, descriptive tags are associated with 10-second music segments. These tags generally fall into three categories: musical instruments (guitar, drums, etc.), musical genres (pop, electronic, etc.) and mood based tags (serene, intense, etc.). The types of tags are different than what we are detecting as part of this dataset. The events in our dataset have a particular temporal structure unlike the categories that are the target of auto-tagging. Additionally, we analyze the entire music track and detect start points of music events rather than short segments like auto-tagging.

There are three music events in our Socially Significant Music Event dataset: Drop, Build, and Break. These events can be considered to form the basic set of events used by the EDM producers [1, 2]. They have a certain temporal structure internal to themselves, which can be of varying complexity. Their social significance is visible from the presence of large number of timed comments related to these events on SoundCloud (Figure 1,2). The three events are popular in the social media circles with listeners often mentioning them in comments. Here, we define these events [2]:

  1. Drop: A point in the EDM track, where the full bassline is re-introduced and generally follows a recognizable build section
  2. Build: A section in the EDM track, where the intensity continuously increases and generally climaxes towards a drop
  3. Break: A section in an EDM track with a significantly thinner texture, usually marked by the removal of the bass drum

Figure 1. Screenshot from SoundCloud showing a list of timed comments left by listeners on a music track [11].

Figure 1. Screenshot from SoundCloud showing a list of timed comments left by listeners on a music track [11].


SoundCloud is an online music sharing platform that allows users to record, upload, promote and share their self-created music. SoundCloud started out as a platform for amateur musicians, but currently many leading music labels are also represented. One of the interesting features of SoundCloud is that it allows “timed comments” on the music tracks. “Timed comments” are comments, left by listeners, associated with a particular time point in the music track. Our “Socially Significant Music Events” dataset is inspired by the potential usefulness of these timed comments as ground truth for training music event detectors. Figure 2 contains an example of a timed comment: “That intense buildup tho” (timestamp 00:46). We could potentially use this as a training label to detect a build, for example. In a similar way, listeners also mention the other events in their timed comments. So, these timed comments can serve as training labels to build machine learning algorithms to detect events.

Figure 2. Screenshot from SoundCloud indicating the useful information present in the timed comments. [11]

Figure 2. Screenshot from SoundCloud indicating the useful information present in the timed comments. [11]

SoundCloud also provides a well-documented API [6] with interfaces to many programming languages: Python, Ruby, JavaScript etc. Through this API, one can download the music tracks (if allowed by the uploader), timed comments and also other metadata related to the track. We used this API to collect our dataset. Via the search functionality we searched for tracks uploaded during the year 2014 with a Creative Commons license, which results in a list of tracks with unique identification numbers. We looked at the timed comments of these tracks for the keywords: drop, break and build. We kept the tracks whose timed comments contained a reference to these keywords and discarded the other tracks.


The dataset contains 402 music tracks with an average duration of 4.9 minutes. Each track is accompanied by timed comments relating to Drop, Build, and Break. It is also accompanied by ground truth labels that mark the true locations of the three events within the tracks. The labels were created by a team of experts. Unlike many other publicly available music datasets that provide only metadata or short previews of music tracks  [9], we provide the entire track for research purposes. The download instructions for the dataset can be found here: [3]. All the music tracks in the dataset are distributed under the Creative Commons license. Some statistics of the dataset are provided in Table 1.  

Table 1. Statistics of the dataset: Number of events, Number of timed comments

Event Name Total number of events Number of events per track Total number of timed comments Number of timed comments per track
Drop  435  1.08  604  1.50
Build  596  1.48  609  1.51
Break  372  0.92  619  1.54

The main purpose of the dataset is to support training of detectors for the three events of interest (Drop, Build, and Break) in a given music track. These three events can be considered a case study to prove that it is possible to detect socially significant musical events, opening the way for future work on an extended inventory of events. Additionally, the dataset can be used to understand the properties of timed comments related to music events. Specifically, timed comments can be used to reduce the need for manually acquired ground truth, which is expensive and difficult to obtain.

Timed comments present an interesting research challenge: temporal noise. The timed comments and the actual events do not always coincide. The comments could be at the same position, before, or after the actual event. For example, in the below music track (Figure 3), there is a timed comment about a drop at 00:40, while the actual drop occurs only at 01:00. Because of this noisy nature, we cannot use the timed comments alone as ground truth. We need strategies to handle temporal noise in order to use timed comments for training [1].

Figure 3. Screenshot from SoundCloud indicating the noisy nature of timed comments [11].

Figure 3. Screenshot from SoundCloud indicating the noisy nature of timed comments [11].

In addition to music event detection, our “Socially Significant Music Event” dataset opens up other possibilities for research. Timed comments have the potential to improve users’ access to music and to support them in discovering new music. Specifically, timed comments mention aspects of music that are difficult to derive from the signal, and may be useful to calculate song-to-song similarity needed to improve music recommendation. The fact that the comments are related to a certain time point is important because it allows us to derive continuous information over time from a music track. Timed comments are potentially very helpful for supporting listeners in finding specific points of interest within a track, or deciding whether they want to listen to a track, since they allow users to jump-in and listen to specific moments, without listening to the track end-to-end.

State of the art

The detection of music events requires training classifiers that are able to generalize over the variability in the audio signal patterns corresponding to events. In Figure 4, we see that the build-drop combination has a characteristic pattern in the spectral representation of the music signal. The build is a sweep-like structure and is followed by the drop, which we indicate by a red vertical line. More details about the state-of-the-art features useful for music event detection and the strategies to filter the noisy timed comments can be found in our publication [1].

Figure 4. The spectral representation of the musical segment containing a drop. You can observe the sweeping structure indicating the buildup. The red vertical line is the drop.

Figure 4. The spectral representation of the musical segment containing a drop. You can observe the sweeping structure indicating the buildup. The red vertical line is the drop.

The evaluation metric used to measure the performance of a music event detector should be chosen according to the user scenario for that detector. For example, if the music event detector is used for non-linear access (i.e., creating jump-in points along the playbar) it is important that the detected time point of the event falls before, rather than after, the actual event.  In this case, we recommend using the “event anticipation distance” (ea_dist) as a metric. The ea_dist is amount of time that the predicted event time point precedes an actual event time point and represents the time the user would have to wait to listen to the actual event. More details about ea_dist can be found in our paper [1].

In [1], we report the implementation of a baseline music event detector that uses only timed comments as training labels. This detector attains an ea_dist of 18 seconds for a drop. We point out that from the user point of view, this level of performance could already lead to quite useful jump-in points. Note that the typical length of a build-drop combination is between 15-20 seconds. If the user is positioned 18 seconds before the drop, the build would have already started and the user knows that a drop is coming. Using an optimized combination of timed comments and manually acquired ground truth labels we are able to achieve an ea_dist of 6 seconds.


Timed comments, on their own, can be used as training labels to train detectors for socially significant events. A detector trained on timed comments performs reasonably well in applications like non-linear access, where the listener wants to jump through different events in the music track without listening to it in its entirety. We hope that the dataset will encourage researchers to explore the usefulness of timed comments for all media. Additionally, we would like to point out that our work has demonstrated that the impact of temporal noise can be overcome and that the contribution of timed comments to video event detection is worth investigating further.


Should you have any inquiries or questions about the dataset, do not hesitate to contact us via email at:


[1] K. Yadati, M. Larson, C. Liem and A. Hanjalic, “Detecting Socially Significant Music Events using Temporally Noisy Labels,” in IEEE Transactions on Multimedia. 2018.

[2] M. Butler, Unlocking the Groove: Rhythm, Meter, and Musical Design in Electronic Dance Music, ser. Profiles in Popular Music. Indiana University Press, 2006 






[8] H. Y. Lo, J. C. Wang, H. M. Wang and S. D. Lin, “Cost-Sensitive Multi-Label Learning for Audio Tag Annotation and Retrieval,” in IEEE Transactions on Multimedia, vol. 13, no. 3, pp. 518-529, June 2011.




ACM Fellows in the SIGMM Community

Multimedia can be defined as the seamless integration of digital technologies in ways which provide for an enriched experience for users as we create and consume information with high fidelity.  Behind that definition lies a host of enabling digital technologies to allow us create, capture, store, analyse, index, locate, transmit and present information. But when did multimedia, as we now know it, start?  Was it the ideas of Vannevar Bush and Memex, or Ted Nelson and Xanadu or the development of Apple computers in the mid 1980s or maybe the emergence of the web which enables distribution of multimedia?

Certainly by the early 1990s and definitely by 1993 when SIGMM was founded, multimedia was established and ACM SIGMMrecognised as a mainstream activity within computing. Over the intervening two and a half decades we’ve seen tremendous progress, incredible developments and a wholesale adoption of our technologies right across our society.  All this has been achieved partly on the backs of innovations by many eminent scientists and technologists who are leaders within our SIGMM community.

We recently saw two of our SIGMM community elevated to the grade of ACM Fellow, joining the 52 other new ACM Fellows in the class of 2017. Our congratulations go to Yong Rui and to Shih-Fu Chang for their elevation to that grade. Yong had a lovely interview for SIGMM on the significance of this honour for him as a researcher, and for us all in SIGMM, which is available at and its worth reflecting on some of our other SIGMM family who have been similarly honoured in the past.

While checking SIGMM membership is an easy thing to do (though its a bit more difficult to check back throughout our membership history) it is a bit arbitrary to define who is and who is not part of our SIGMM “family”.  To me its somebody who is, or has been, an active participant or organiser of our events, or a contributor to our field. Our SIGMM family includes those I would associate with SIGMM rather than any other SIG, and with ACM rather than with any other society.

In the class of new ACM Fellows for 2017 Shih-Fu Chang is elevated “for contributions to large-scale multimedia content recognition and multimedia information retrieval”. Shih-Fu is my predecessor as SIGMM chair and still serves on the SIGMM Executive as well as maintaining a hugely impressive research output.  He won the SIGMM Outstanding Technical Achievement Award in 2011. 

Yong Rui was also elevated to ACM Fellow in 2017 “for contributions to image, video and multimedia analysis, understanding and retrieval”.  Yong is a long-time supporter of SIGMM Conferences as well as a regular attendee and major contributor to our field.

Wen Gao of Peking University is vice president of the National Natural Science Foundation of China and was a co-chair of ACM Multimedia in 2009. He is also on the advisory board of ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) and was elevated to ACM Fellow in 2013 “for contributions to video technology, and for leadership to advance computing in China”.

Zhengyou Zhang was also elevated in 2013 “for contributions to computer vision and multimedia” and continues to serve the SIGMM community, most recently as Best Papers chair at MM 2017.

Klara Nahrstedt (class of 2012) was elevated “for contributions to quality-of-service management for distributed multimedia systems” and has served as SIGMM Chair prior to Shih-Fu. In 2014 Klara won the SIGMM Technical Achievement Award and until last year she also served on the SIGMM Executive Committee.

Joe Konstan (class of 2008) was elevated “for contributions to human-computer interaction” and he also won the ACM Software System Award in 2010. Joe was the ACM MM 2000 TPC Chair and was on the SIGMM Executive Committee from 1999 to 2007.

HongJiang Zhang (class of 2007) was elevated to Fellow “for contributions to content-based analysis and retrieval of multimedia”.  HongJiang also won the 2012 SIGMM Outstanding Technical Achievement Award and he has a huge publications output with a Google Scholar h-index of 120.

Ramesh Jain (class of 2003) was elevated “for contributions to computer vision and multimedia information systems”. Ramesh remains one of the most prolific authors in our field and a regular, almost omnipresent, attendee at our major SIGMM conferences. In 2010 Ramesh won the SIGMM Outstanding Technical Achievement Award.

Ralf Steinmetz (class of 2001) was elevated for “pioneering work in multimedia communications and education, including fundamental contributions in perceivable Quality of Service for multimedia systems derived from multimedia synchronization, and for multimedia education”. Ralf is also the winner of the inaugural  ACM SIGMM Technical Achievement Award, presented in 2008 and between 2009 and 2015 he served as Editor-in-Chief of the ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), formerly known as TOMCCAP.

Larry Rowe (class of 1998) was elevated “for seminal contributions to programming languages, relational database technology, user interfaces and multimedia systems”.  Larry is a past chair of SIGMM (1998-2003) and in 2009 he received the SIGMM Technical Achievement Award.
P. Venkat Rangan was elevated to ACM Fellow in 1998. At the recent ACM MM Conference in 2017, we had a short presentation on the first ACM MM Conference in 1993 and Venkat’s efforts in organising that first MM was acknowledged in that presentation. Venkat’s ACM Fellowship citation says that he “founded one of the foremost centers for research in multimedia, in which area he is an inventor of fundamental techniques with global impact”.

What is interesting to note about these awardees is the broad range of areas in which their contributions are grounded, covering the “traditional” areas of multimedia. These range from quality of service delivery across networks to analysis of content and from user interfaces and interaction to progress in fundamental computer vision.  This reflects the broad range of research areas covered by the SIGMM community, which has been part of our DNA since SIGMM was founded.

Our ACM Fellows are a varied and talented group of individuals, each richly deserving of their award and their only single unifying theme is broad multimedia, and that’s one of our distinguishing features. In some SIGs like SIGARCH (computer architecture), SIGCSE (computer science education) or SIGIR (information retrieval), there’s a focus on a narrow topic or a challenge while in other SIGs like SIGCHI (computer human interaction), SIGMOD (management of data) or SIGAI (artificial intelligence), there are a broad range of research areas.  SIGMM sits with those areas where our application and impact is broad.

The ACM Fellow awards started nearly 25 years ago. Further details can be found at and a link to each of the awards can be found at

JPEG Column: 78th JPEG Meeting in Rio de Janeiro, Brazil

The JPEG Committee had its 78th meeting in Rio de Janeiro, Brazil. Relevant to its ongoing standardization efforts in JPEG Privacy and Security, JPEG organized a special session to explore how to support blockchain and distributed ledger technologies to past, ongoing and future JPEG family of standards. This is motivated by the fact that considering the potential impact of such technologies in the future of multimedia, standardization will be required to enable interoperability between different systems and services of imaging relying on blockchain and distributed ledger technologies.

Blockchain and distributed ledger technologies are behind the well-known crypto-currencies. These technologies can provide means for content authorship, or intellectual property and rights management control of the multimedia information. New possibilities can be made available, namely support for tracking online use of copyrighted images and ownership of the digital content.


JPEG meeting session.

Rio de Janeiro JPEG meetings comprise mainly the following highlights:

  • JPEG explores blockchain and distributed ledger technologies
  • JPEG 360 Metadata
  • JPEG Pleno
  • JPEG Reference Software
  • JPEG 25th anniversary of the first JPEG standard

The following summarizes various activities during JPEG’s Rio de Janeiro meeting.

JPEG explores blockchain and distributed ledger technologies

During the 78th JPEG meeting in Rio de Janeiro, the JPEG committee organized a special session on blockchain and distributed ledger technologies and their impact on JPEG standards. As a result, the committee decided to explore use cases and standardization needs related to blockchain technology in a multimedia context. Use cases will be explored in relation to the recently launched JPEG Privacy and Security, as well as in the broader landscape of imaging and multimedia applications. To that end, the committee created an ad hoc group with the aim to gather input from experts to define these use cases and to explore eventual needs and advantages to support a standardization effort focused on imaging and multimedia applications. To get involved in the discussion, interested parties can register to the ad hoc group’s mailing list. Instructions to join the list are available on

JPEG 360 Metadata

The JPEG Committee notes the increasing use of multi-sensor images from multi-sensor devices, such as 360 degree capturing cameras or dual-camera smartphones available to consumers. Images from these cameras are shown on computers, smartphones, and Head Mounted Displays (HMDs). JPEG standards are commonly used for image compression and file format. However, because existing JPEG standards do not fully cover these new uses, incompatibilities have reduced the interoperability of their images, and thus reducing the widespread ubiquity, which consumers have come to expect when using JPEG files. Additionally, new modalities for interacting with images, such as computer-based augmentation, face-tagging, and object classification, require support for metadata that was not part of the original scope of JPEG.  A set of such JPEG 360 use cases is described in JPEG 360 Metadata Use Cases document. 

To avoid fragmentation in the market and to ensure wide interoperability, a standard way of interacting with multi-sensor images with richer metadata is desired in JPEG standards. JPEG invites all interested parties, including manufacturers, vendors and users of such devices to submit technology proposals for enabling interactions with multi-sensor images and metadata that fulfill the scope, objectives and requirements that are outlined in the final Call for Proposals, available on the JPEG website.

To stay posted on JPEG 360, please regularly consult our website at and/or subscribe to the JPEG 360 e-mail reflector.


The Next-Generation Image Compression activity (JPEG XL) has produced a revised draft Call for Proposals, and intends to publish a final Call for Proposals (CfP) following its 79th meeting (April 2018), with the objective of seeking technologies that fulfill the objectives and scope of the Next-Generation Image Compression. During the 78th meeting, objective and subjective quality assessment methodologies for anchor and proposal evaluations were discussed and analyzed. As outcome of the meeting, source code for objective quality assessment has been made available.

The draft Call for Proposals, with all related info, can be found in JPEG website. Comments are welcome and should be submitted as specified in the document. To stay posted on the action plan for JPEG XL, please regularly consult our website at and/or subscribe to our e-mail reflector.



Since its previous 77th meeting, subjective quality evaluations have shown that the initial quality requirement of the JPEG XS Core Coding System has been met, i.e. a visually lossless quality at a compression ratio of 6:1 for large majority of images under test has been met. Several profiles are now under development in JPEG XS, as well as transport and container formats. JPEG committee therefore invites interested parties – in particular coding experts, codec providers, system integrators and potential users of the foreseen solutions – to contribute to the furthering of the specifications in the above directions. Publication of the International Standard is expected for Q3 2018.

JPEG Pleno

JPEG Pleno activity is currently divided into Pleno Light Field, Pleno Point Cloud and Pleno Holography. JPEG Pleno Light Field has been preparing a third round of core experiments for assessing the impact of individual coding modules on the overall rate-distortion performance. Moreover, it was decided to pursue with collecting additional test data, and progress with the preparation of working documents for JPEG Pleno specifications Part 1 and Part 2.

Furthermore, quality modelling studies are under consideration for both JPEG Pleno Point Clouds, and JPEG Pleno Holography. In particular, JPEG Pleno Point Cloud is considering a set of new quality metrics provided as contributions to this work item. It is expected that the new metrics replace the current state of the art as they have shown superior correlation with subjective quality as perceived by humans. Moreover, new subjective assessment models have been tested and analysed to better understand the perception of quality for such new types of visual information.

JPEG Reference Software

The JPEG committee is pleased to announce that its first JPEG image coding specifications is now augmented by a new part, ISO/IEC 10918-7, that contains a reference software. The proposed candidate software implementations have been checked for compliance with 10918-2. Considering the positive results, this new part of the JPEG standard will continue to evolve quickly. 


JPEG meeting room window view during a break.

JPEG 25th anniversary of the first JPEG standard

JPEG’s first standard third and final 25th anniversary celebration is planned at its next 79th JPEG meeting taking place in La Jolla, CA, USA. The anniversary will be marked by a 2 hours workshop on Friday 13th April on current and emerging JPEG technologies, followed by a social event where past JPEG committee members with relevant contributions will be awarded.

Final Quote

“Blockchain and distributed ledger technologies promise a significant impact on the future of many fields. JPEG is committed to provide standard mechanisms to apply blockchain on multimedia applications in general and on imaging in particular. said Prof. Touradj Ebrahimi, the Convenor of the JPEG Committee.


About JPEG

The Joint Photographic Experts Group (JPEG) is a Working Group of ISO/IEC, the International Organisation for Standardization / International Electrotechnical Commission, (ISO/IEC JTC 1/SC 29/WG 1) and of the International Telecommunication Union (ITU-T SG16), responsible for the popular JBIG, JPEG, JPEG 2000, JPEG XR, JPSearch and more recently, the JPEG XT, JPEG XS, JPEG Systems and JPEG Pleno families of imaging standards.

The JPEG Committee meets nominally four times a year, in different world locations. The latest 77th meeting was held from 21st to 27th of October 2017, in Macau, China. The next 79th JPEG Meeting will be held on 9-15 April 2018, in La Jolla, California, USA.

More information about JPEG and its work is available at or by contacting Antonio Pinheiro or Frederik Temmermans ( of the JPEG Communication Subgroup.

If you would like to stay posted on JPEG activities, please subscribe to the jpeg-news mailing list on  

Future JPEG meetings are planned as follows:

  • No 79, La Jolla (San Diego), CA, USA, April 9 to 15, 2018
  • No 80, Berlin, Germany, July 7 to13, 2018
  • No 81, Vancouver, Canada, October 13 to 19, 2018



MPEG Column: 121st MPEG Meeting in Gwangju, Korea

The original blog post can be found at the Bitmovin Techblog and has been updated here to focus on and highlight research aspects.

The MPEG press release comprises the following topics:

  • Compact Descriptors for Video Analysis (CDVA) reaches Committee Draft level
  • MPEG-G standards reach Committee Draft for metadata and APIs
  • MPEG issues Calls for Visual Test Material for Immersive Applications
  • Internet of Media Things (IoMT) reaches Committee Draft level
  • MPEG finalizes its Media Orchestration (MORE) standard

At the end I will also briefly summarize what else happened with respect to DASH, CMAF, OMAF as well as discuss future aspects of MPEG.

Compact Descriptors for Video Analysis (CDVA) reaches Committee Draft level

The Committee Draft (CD) for CDVA has been approved at the 121st MPEG meeting, which is the first formal step of the ISO/IEC approval process for a new standard. This will become a new part of MPEG-7 to support video search and retrieval applications (ISO/IEC 15938-15).

Managing and organizing the quickly increasing volume of video content is a challenge for many industry sectors, such as media and entertainment or surveillance. One example task is scalable instance search, i.e., finding content containing a specific object instance or location in a very large video database. This requires video descriptors which can be efficiently extracted, stored, and matched. Standardization enables extracting interoperable descriptors on different devices and using software from different providers, so that only the compact descriptors instead of the much larger source videos can be exchanged for matching or querying. The CDVA standard specifies descriptors that fulfil these needs and includes (i) the components of the CDVA descriptor, (ii) its bitstream representation and (iii) the extraction process. The final standard is expected to be finished in early 2019.

CDVA introduces a new descriptor based on features which are output from a Deep Neural Network (DNN). CDVA is robust against viewpoint changes and moderate transformations of the video (e.g., re-encoding, overlays), it supports partial matching and temporal localization of the matching content. The CDVA descriptor has a typical size of 2–4 KBytes per second of video. For typical test cases, it has been demonstrated to reach a correct matching rate of 88% (at 1% false matching rate).

Research aspects: There are probably endless research aspects in the visual descriptor space ranging from validation of the achieved to results so far to further improving informative aspects with the goal to increase correct matching rate (and consequently decreasing the false matching rating). In general, however, the question is whether there’s a need for descriptors in the era of bandwidth-storage-computing over-provisioning and the raising usage of artificial intelligence techniques such as machine learning and deep learning.

MPEG-G standards reach Committee Draft for metadata and APIs

In my previous report I introduced the MPEG-G standard for compression and transport technologies of genomic data. At the 121st MPEG meeting, metadata and APIs reached CD level. The former – metadata – provides relevant information associated to genomic data and the latter – APIs – allow for building interoperable applications capable of manipulating MPEG-G files. Additional standardization plans for MPEG-G include the CDs for reference software (ISO/IEC 23092-4) and conformance (ISO/IEC 23092-4), which are planned to be issued at the next 122nd MPEG meeting with the objective of producing Draft International Standards (DIS) at the end of 2018.

Research aspects: Metadata typically enables certain functionality which can be tested and evaluated against requirements. APIs allow to build applications and services on top of the underlying functions, which could be a driver for research projects to make use of such APIs.

MPEG issues Calls for Visual Test Material for Immersive Applications

I have reported about the Omnidirectional Media Format (OMAF) in my previous report. At the 121st MPEG meeting, MPEG was working on extending OMAF functionalities to allow the modification of viewing positions, e.g., in case of head movements when using a head-mounted display, or for use with other forms of interactive navigation. Unlike OMAF which only provides 3 degrees of freedom (3DoF) for the user to view the content from a perspective looking outwards from the original camera position, the anticipated extension will also support motion parallax within some limited range which is referred to as 3DoF+. In the future with further enhanced technologies, a full 6 degrees of freedom (6DoF) will be achieved with changes of viewing position over a much larger range. To develop technology in these domains, MPEG has issued two Calls for Test Material in the areas of 3DoF+ and 6DoF, asking owners of image and video material to provide such content for use in developing and testing candidate technologies for standardization. Details about these calls can be found at

Research aspects: The good thing about test material is that it allows for reproducibility, which is an important aspect within the research community. Thus, it is more than appreciated that MPEG issues such a call and let’s hope that this material will become publicly available. Typically this kind of visual test material targets coding but it would be also interesting to have such test content for storage and delivery.

Internet of Media Things (IoMT) reaches Committee Draft level

The goal of IoMT is is to facilitate the large-scale deployment of distributed media systems with interoperable audio/visual data and metadata exchange. This standard specifies APIs providing media things (i.e., cameras/displays and microphones/loudspeakers, possibly capable of significant processing power) with the capability of being discovered, setting-up ad-hoc communication protocols, exposing usage conditions, and providing media and metadata as well as services processing them. IoMT APIs encompass a large variety of devices, not just connected cameras and displays but also sophisticated devices such as smart glasses, image/speech analyzers and gesture recognizers. IoMT enables the expression of the economic value of resources (media and metadata) and of associated processing in terms of digital tokens leveraged by the use of blockchain technologies.

Research aspects: The main focus of IoMT is APIs which provides easy and flexible access to the underlying device’ functionality and, thus, are an important factor to enable research within this interesting domain. For example, using these APIs to enable communicates among these various media things could bring up new forms of interaction with these technologies.

MPEG finalizes its Media Orchestration (MORE) standard

MPEG “Media Orchestration” (MORE) standard reached Final Draft International Standard (FDIS), the final stage of development before being published by ISO/IEC. The scope of the Media Orchestration standard is as follows:

  • It supports the automated combination of multiple media sources (i.e., cameras, microphones) into a coherent multimedia experience.
  • It supports rendering multimedia experiences on multiple devices simultaneously, again giving a consistent and coherent experience.
  • It contains tools for orchestration in time (synchronization) and space.

MPEG expects that the Media Orchestration standard to be especially useful in immersive media settings. This applies notably in social virtual reality (VR) applications, where people share a VR experience and are able to communicate about it. Media Orchestration is expected to allow synchronizing the media experience for all users, and to give them a spatially consistent experience as it is important for a social VR user to be able to understand when other users are looking at them.

Research aspects: This standard enables the social multimedia experience proposed in literature. Interestingly, the W3C is working on something similar referred to as timing object and it would be interesting to see whether these approaches have some commonalities.

What else happened at the MPEG meeting?

DASH is fully in maintenance mode and we are still waiting for the 3rd edition which is supposed to be a consolidation of existing corrigenda and amendments. Currently only minor extensions are proposed and conformance/reference software is being updated. Similar things can be said for CMAF where we have one amendment and one corrigendum under development. Additionally, MPEG is working on CMAF conformance. OMAF has reached FDIS at the last meeting and MPEG is working on reference software and conformance also. It is expected that in the future we will see additional standards and/or technical reports defining/describing how to use CMAF and OMAF in DASH.

Regarding the future video codec, the call for proposals is out since the last meeting as announced in my previous report and responses are due for the next meeting. Thus, it is expected that the 122nd MPEG meeting will be the place to be in terms of MPEG’s future video codec. Speaking about the future, shortly after the 121st MPEG, Leonardo Chiariglione published a blog post entitled “a crisis, the causes and a solution”, which is related to HEVC licensing, Alliance for Open Media (AOM), and possible future options. The blog post certainly caused some reactions within the video community at large and I think this was also intended. Let’s hope it will galvanice the video industry — not to push the button — but to start addressing and resolving the issues. As pointed out in one of my other blog posts about what to care about in 2018, the upcoming MPEG meeting in April 2018 is certainly a place to be. Additionally, it highlights some conferences related to various aspects also discussed in MPEG which I’d like to republish here:

  • QoMEX — Int’l Conf. on Quality of Multimedia Experience — will be hosted in Sardinia, Italy from May 29-31, which is THE conference to be for QoE of multimedia applications and services. Submission deadline is January 15/22, 2018.
  • MMSys — Multimedia Systems Conf. — and specifically Packet Video, which will be on June 12 in Amsterdam, The Netherlands. Packet Video is THE adaptive streaming scientific event 2018. Submission deadline is March 1, 2018.
  • Additionally, you might be interested in ICME (July 23-27, 2018, San Diego, USA), ICIP (October 7-10, 2018, Athens, Greece; specifically in the context of video coding), and PCS (June 24-27, 2018, San Francisco, CA, USA; also in the context of video coding).
  • The DASH-IF academic track hosts special events at MMSys (Excellence in DASH Award) and ICME (DASH Grand Challenge).
  • MIPR — 1st Int’l Conf. on Multimedia Information Processing and Retrieval — will be in Miami, Florida, USA from April 10-12, 2018. It has a broad range of topics including networking for multimedia systems as well as systems and infrastructures.

Report from ACM Multimedia 2017 – by Benoit Huet


Best #SIGMM Social Media Reporter Award! Me? Really?? fig_huet_1

This was my reaction after being informed by the SIGMM Social Media Editors that I was one of the two recipients following ACM Multimedia 2017! #ACMMM What a wonderful idea this is to encourage our community to communicate, both internally and to other related communities, about our events, our key research results and all the wonderful things the multimedia community stands for!  I have always been surprised by how limited social media engagement is within the multimedia community. Your initiative has all my support! Let’s disseminate our research interest and activities on social media! @SIGMM #Motivated


The SIGMM flagship conference took place on October 23-27 at the Computer History Museum in Mountain View California, USA. For its 25th edition, the organizing committee had prepared an attractive program cleverly mixing expected classics (i.e. Best Paper session, Grand Challenges, Open Source software competition, etc…) and brand new sessions (such as Fast Forward and Thematic Workshops, Business Idea Venture, and the Novel Topics Track). In this last edition, the conference adopted a single paper length, removing the boundary between long and short papers. The TPC Co-Chairs and Area Chairs had the responsibility of directing accepted papers to either an oral session or a thematic workshop.

Thematic workshops took the form of poster presentations. Presenters were asked to provide a short video briefly motivating their work with the intention of making them available online for reference after the conference (possibly with a link to the full paper and the poster!). However, this did not come through as publication permissions were not cleared out in time, but the idea is interesting and should be taken into account for future editions. Fast forward (or Thematic workshop pitches) are short targeted presentations aimed at attracting the audience to the Thematic Workshop where the papers are presented (in the form of posters in this case). While such short presentations allow conference attendees to efficiently identify which poster are relevant to them, it is crucial for presenters to be well prepared and concentrate on highlighting one key research idea, as time is very limited. It also gives more exposure to poster. I would be in favor of keeping such sessions for future ACM Multimedia editions.

The 25th edition of ACM MM wasn’t short of keynotes. No less than 6 industry keynotes had punctuated each of the conference half day. The first keynote by Achin Bhowmik from Starkey focused on Audio as a mean to “Enhancing and Augmenting Human Perception with Artificial Intelligence”. Bill Dally from NVidia presented “Efficient Methods and Hardware for Deep Learning”, in short why we all need GPUs! “Building Multi-Modal Interfaces for Smartphones” was the topic presented by Injong Rhee (Samsung Electronics), Scott Silver (YouTube) discussed the difficulties in “Bringing a Billion Hours to Life” (referring to the vast quantities of videos uploaded and viewed on the sharing platform, and the long tail). Ed. Chang from HTC presented “DeepQ: Advancing Healthcare Through AI and VR” and demonstrated how healthcare is and will benefit from AR, VR and AI. Danny Lange from Unity Technologies highlighted how important machine learning and deep learning are in the game industry in ”Bringing Gaming, VR, and AR to Life with Deep Learning”.  Personally, I would have preferred a mix of industry/academic keynotes as I found some of the keynotes not targeting an audience of computer scientists.

Arnold W. M. Smeulders received the SIGMM Technical Achievement Award for his outstanding and pioneeringfig_huet_3 contribution defining and bridging the semantic gap in content based image retrieval (his lecture is here: His talk was sharp, enlightening and very well received by the audience.

The @sigmm rising star award went to Dr Liangliang Cao for his contribution to large-scale multimedia recognition and social media mining.

The conference was noticeably flavored with trendy topics such as AI, Human augmenting technologies, Virtual and Augmented Reality, and Machine (Deep) Learning, as can be noticed from the various works rewarded.

The Best Paper award was given to Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, Heng Tao Shen for their work on “Adversarial Cross-Modal Retrieval“.

Yuan Tian, Suraj Raghuraman, Thiru Annaswamy, Aleksander Borresen, Klara Nahrstedt, Balakrishnan Prabhakaran received the Best Student Paper award for the paper “H-TIME: Haptic-enabled Tele-Immersive Musculoskeletal Examination“.

The Best demo award went to “NexGenTV: Providing Real-Time Insight during Political Debates in a Second Screen Application” by Olfa Ben Ahmed, Gabriel Sargent, Florian Garnier, Benoit Huet, Vincent Claveau, Laurence Couturier, Raphaël Troncy, Guillaume Gravier, Philémon Bouzy and Fabrice Leménorel.

The Best Open source software award was received by Hao Dong, Akara Supratak, Luo Mai, Fangde Liu, Axel Oehmichen, Simiao Yu, Yike Guo for “TensorLayer: A Versatile Library for Efficient Deep Learning Development“.

The Best Grand Challenge Video Captioning Paper award went to “Knowing Yourself: Improving Video Caption via In-depth Recap“, by Qin Jin, Shizhe Chen, Jia Chen, Alexander Hauptmann.

The Best Grand Challenge Social Media Prediction Paper award went to Chih-Chung Hsu, Ying-Chin Lee, Ping-En Lu, Shian-Shin Lu, Hsiao-Ting Lai, Chihg-Chu Huang,Chun Wang, Yang-Jiun Lin, Weng-Tai Su for “Social Media Prediction Based on Residual Learning and Random Forest“.

Finally, the Best Brave New Idea Paper award was conferred to John R Smith, Dhiraj Joshi, Benoit Huet, Winston Hsu and Zef Cota for the paper “Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer Creation“.

A few years back, the multimedia community was concerned with the lack of truly multimedia publications. In my opinion, those days are behind us. The technical program has evolved into a richer and broader one, let’s keep the momentum!

The location was a wonderful opportunity for many of the attendees to take a stroll down memory lane and see fig_huet_4computers and devices (VT100, PC, etc…) from the past thanks to the complementary entrance to the museum exhibitions. The “isolated” location of the conference venue meant going out for lunch breaks was out of the question given the duration of the lunch break. As a solution, the organizers catered buffet lunches. This resulted in the majority of the attendees interacting and mixing over the lunch break while eating. This could be an effective way to better integrate new participants and strengthen the community.  Both the welcome reception and the banquet were held successfully within Computer Museum. Both events offer yet another opportunity for new connections to be made and for further interaction between attendees. Indeed, the atmosphere of both occasions was relaxed, lively and joyful. 

All in all, ACM MM 2017 was another successful edition of our flagship conference, many thanks to the entire organizing team and see you all in Seoul for ACM MM 2018 and follow @sigmm on Twitter!

Report from ACM Multimedia 2017 – by Conor Keighrey


My name is Conor Keighrey, I’m a PhD. candidate at the Athlone Institute Technology in Athlone, Co. Westmeath, Ireland.  The focus of my research is to understand the key influencing factors that affect Quality of Experience (QoE) in emerging immersive multimedia experiences, with a specific focus on applications in the speech and language therapy domain. I am funded for this research by the Irish Research Council Government of Ireland Postgraduate Scholarship Programme. I’m delighted to have been asked to present this report to the SIGMM community as a result of my social media activity at ACM Multimedia Conference.

Launched in 1993, the ACM Multimedia (ACMMM) Conference held its 25th anniversary event in the Mountain View, California. The conference was located in the heart of Silicon Valley, at the inspirational Computer History Museum.

Under five focal themes, the conference called for multimedia papers which focused on topics relating to multimedia: Experience, Systems and Applications, Understanding, Novel Topics, and Engagement.

Keynote addresses were delivered by high-profile industry leading experts from the field of multimedia. These talks provided insight into the active development from the following experts:

  • Achin Bhowmik (CTO & EVP, Starkey, USA)
  • Bill Dally (Senior Vice President and Chief Scientist, NVidia, USA)
  • Injong Rhee (CTO & EVP, Samsung Electronics, Korea)
  • Edward Y. Chang (President, HTC, Taiwan)
  • Scott Silver (Vice President, Google, USA)
  • Danny Lange (Vice President, Unity Technologies, USA)

Some keynote highlights include Bill Dally’s talk on “Efficient Methods and Hardware for Deep Learning”. Bill provided insight into the work NVidia are doing with neural networks, the hardware which drives them, and the techniques the company are using to make them more efficient. He also highlighted how AI should not be thought of as a mechanism which replaces, but empower humans, thus allowing us to explore more intellectual activities.

Danny Lange of Unity Technologies discussed the application of the Unity game engine to create scenarios in which machine learning models can be trained. His presentation entitled “Bringing Gaming, VR, and AR to Life with Deep Learning” described the capture of data for self-driving cars to prepare for unexpected occurrences in the real world (e.g. pedestrians activity or other cars behaving in unpredicted ways).

A number of the Keynotes were captured by FXPAL (an ACMMM Platinum Sponsor) and are available here.

With an acceptance rate of 27.63% (684 reviewed, 189 accepted), the main track at ACMMM showcased a diverse collection of research from academic institutes around the globe. An abundance of work was presented in the ever-expanding area of deep/machine learning, virtual/augmented/mixed realities, and the traditional multimedia field.


The importance of gender equality and diversity with respect to advancing careers of women in STEM has never been greater. Sponsored by SIGMM, the Women/Diversity in MM lunch took place on the first day of ACMMM. Speakers such as Prof. Noel O’Conner discussed the significance of initiatives such as Athena SWAN (Scientific Women’s Academic Network) within Dublin City University (DCU). Katherine Breeden (Pictured left), an Assistant Professor in the Department of Computer Science at Harvey Mudd College (HMC), presented a fantastic talk on gender balance at HMC. Katherine’s discussion highlighted the key changes which have occurred resulting in more women than men graduating with a degree in computer science at the college.

Other highlights from day 1 include a paper presented at the Experience 2 (Perceptual, Affect, and Interaction) session, chaired by Susanne Boll (University of Oldenburg). Researchers from the National University of Singapore presented the results of a multisensory virtual cocktail (Vocktail) experience which was well received. 


conor3Through the stimulation of 3 sensory modalities, Vocktails aim to create virtual flavor, and augment taste experiences through a customizable interactive drinking utensil. Controlled by a mobile device, participants of the study experienced augmented taste (electrical stimulation of the tongue), smell (micro air-pumps), and visual (RGB light projected onto the liquid) stimulus as they used the system. For more information, check out their paper entitled “Vocktail: A Virtual Cocktail for Pairing Digital Taste, Smell, and Color Sensations” on the ACM Digital Library.

Day 3 of the conference included a session entitled Brave New Ideas. The session presented a fantastic variety of work which focused on the use of multimedia technologies to enhance or create intelligent systems. Demonstrating AI as an assistive tool and winning the Best Brave New Idea Paper award, a paper entitled “Harnessing A.I. for Augmenting Creativity: Application to Movie Trailer Creation” (ACM Digital Library) describes the first-ever human machine collaboration for creating a real movie trailer. Through multi-modal semantic extraction, inclusive of audio-visual, scene analysis, and a statistical approach, key moments which characterize horror films were defined. As a result of this, the AI selected 10 scenes from a feature length film which were further developed alongside a professional film maker to finalize an exciting movie trailer. Officially released by 20th Century Fox, the complete AI trailer for the horror movie “Morgan” can be viewed here.

A new addition to the last ACMMM edition year has been the inclusion of thematic workshops. Four individual workshops (as outlined below) provided opportunity for papers which could not be accommodated within the main track to be presented to the multimedia research community. A total of 495 papers were reviewed from which 64 were accepted (12.93%). Authors of accepted papers presented their work via on-stage thematic workshop pitches, which were followed by poster presentations on Monday the 23rd and Friday the 27th. The workshop themes were as follows:

  • Experience (Organised by Wanmin Wu)
  • Systems and Applications (Organised by Roger Zimmermann & He Ma)
  • Engagement (Organised by Jianchao Yang)
  • Understanding (Organised by Qi Tian)

Presented as part of the thematic workshop pitches, one of the most fascinating demos at the conference was a body of work carried out by Audrey Ziwei Hu (University of Toronto). Her paper entitled “Liquid Jets as Logic-Computing Fluid-User-Interfaces” describes a fluid (water) user interface which is presented as a logic-computing device. Water jets form a medium for tactile interaction and control to create a musical instrument known as a hydraulophone.

conor4Steve Mann (Pictured left) from Stanford University, who is regarded as “The Father of Wearable Computing”, provided a fantastic live demonstration of the device. The full paper can be found on the ACM Digital Library, and a live demo can be seen here.

In large scale events such ACMMM, the importance of social media reporting/interaction has never been greater. More than 250 social media interactions (tweets, retweets, and likes) were monitored using the #SIGMM and #ACMMM hashtags, as outlined by the SIGMM Records prior to the event. Descriptive (and multimedia enhanced) social media reports provide a chance for those who encounter an unavoidable schedule overlap, and an opportunity to gather some insight into alternative works presented at the conference.

From my own perspective (as a PhD. student), the most important aspect of social media interaction is that reports often serve as a conversational piece. Developing a social presence throughout the many coffee breaks and social events during the conference is key to the success of building a network of contacts within any community. As a newcomer this can often be a daunting task, recognition of other social media reporters offers the perfect ice-breaker, providing opportunity to discuss and inform each other of the on-going work within the multimedia community. As a result of my own online reporting, I was recognized numerous times throughout the conference. Staying active on social media often leads to the development of a research audience, and social media presence among peers. Engaging in such an audience is key to the success of those who wish to follow a path in academia/research.

Building on my own personal experience, continued attendance to SIGMM conferences (irrespective of paper submission) has so many advantages. While the predominant role of a conference is to disseminate work, the informative aspect of attending such events is often overlooked. The area of multimedia research is moving at a fast pace, and thus having the opportunity to engage directly with researchers in your field of expertise is of upmost importance. Attendance to ACMMM and other SIGMM conferences, such ACM Multimedia Systems, has inspired me to explore alternative methodologies within my own respective research. Without a doubt, continued attendance will inspire my research as I move forward.

ACM Multimedia ‘18 (October 22nd – 26th) – The diverse landscape of modern skyscrapers mixed with traditional Buddhist temples, and palaces that is Seoul, South Korea, will be host to the 26th Annual ACMMM. The 2018 event will without a doubt present a variety of work from the multimedia research community. Regular paper abstracts are due on the 30th of March (Full manuscripts are due on the 8th of April). For more information on next year’s ACM Multimedia conference check out the following link:

An interview with Miriam Redi

Miriam nowadays.

Miriam at the begin of her research career.

Miriam at the begin of her research career.

Describe your journey into computing from your youth up to the present. What foundational lessons did you learn from this journey? Why were you initially attracted to multimedia?

I literally grew up with computers all around me. I was born in a little town raised around the headquarters of Olivetti, one of the biggest tech companies of the last century: becoming a computer geek, in that place, at that time, was easier than usual! I have always been fascinated by the power of visuals and music to convey ideas. I loved to learn about history and the world through songs and movies. How to merge my love for computers with my passion for the audiovisual arts? I enrolled  in Media Engineering studies, where, aside from the traditional Computer Engineering knowledge, I had the chance to learn more about media history and design. The main message? Multidisciplinarity is key. We cannot design intelligent multimedia technologies without deeply understanding how a media is created, perceived and distributed.

Talking about multidisciplinary, what do you think is the current state of multidisciplinarity in the multimedia community?

My impression is that, due to the inherent multimodality of our research, our community has developed a natural ability of blending techniques and theories from various domains. I believe we can push the boundaries of this multidisciplinarity even further. I am thinking, for example, of that MM subcommunity interested in mining subjective attributes from data, such as mood, sentiment, or beauty. I believe such research works could incredibly benefit from a collaboration between MM scientists and domain experts in psychology, cognitive science, visual perception, or visual arts.

Tell us more about your vision and objectives behind your current roles? What do you hope to accomplish and how will you bring this about?

My dream is to make multimedia science even more useful for society and for collective growth. Multimedia data allows to easily absorb and communicate knowledge, without language barriers. Producing and generating audiovisual content has never been easier: today, the potential of multimedia for learning and sharing human knowledge is unprecedented! Intelligent multimedia systems could be put in place to support editors communities in making free online encyclopedias like Wikipedia or collaborative knowledge bases like Wikidata more “visual” – and therefore less tied to individual languages. By doing so, we could increase the possibility for people around the world to freely access the sum of all knowledge.

I like your approach about making something useful for society. What do you think about the criticism that multimedia research is too applied?

For me, high-quality research means creative research. Where ‘creative’ means ‘new and valuable’. The coexistence of breath and depth in Multimedia allows to create novel and useful applied research works, thus making these, to me, as interesting as inspiring as more theoretical research works.

Can you profile your current research, its challenges, opportunities, and implications?

I work on responsible multimedia algorithms. I love building machines that can classify audiovisual and textual data according to subjective properties – for example, the informativeness of an image with respect to a topic, its epistemic value, the beauty of a photo, the creative degree of a video. Given the inherently subjective nature of these algorithms, one of the main challenges of my research is to make such models responsible, namely:
1) Diversity-Aware i.e. reflecting the real subjective perception of people with different cultural backgrounds; this is key to empower specific cultures, designing AI to grow diversified content and fill the knowledge gaps in online knowledge repositories.
2) Interpretable and Unbiased, namely not only able to classify content, but also able to say why the content was classified in a certain way (so that we can detect algorithmic bias). Such powerful algorithms can be used to study the visual preferences of users of web and social media platforms, and retrieve interesting content accordingly.

Do you think that one day we will have algorithms that truly understand human perception of beauty and art? Or will it always be depended on the data?

Philosophers have been triying for centuries to understand the true nature of aesthetic perception. In general, I do not believe in absolute truths. And I am not really confident that algorithms will be able to become great philosophers anytime soon.

How would you describe the role of women especially in the field of multimedia?

The role of women in multimedia is the role of any researcher in their scientific community: contribute to scientific development, push the boundaries of what is known, doubt the widely accepted notions, make this world a better place (no pressure!). Maintaining diversity (any kind of diversity – including gender, expertise, race, age) in the scientific discourse is crucial: as opposed to a single mono-culture, a diverse community gathers, elaborates and combines different perspectives, thus forcing a collective creative process of exchange and growth, which is essential to scientific development.

Do you think that female researchers are well presented in the multimedia community? For example, there was not female keynote speaker at ACM MM 2017.

I am not sure about the numbers, so I can’t say for sure the percentage of women and non-binary gender persons in the multimedia community. But I am sure that percentage is greater than 0. When filling positions of high visibility such as keynotes or committee members, I we should always keep in mind that one of our tasks is to inspire younger generations. Generations of young, brilliant, beautifully diverse researchers.

How would you describe your top innovative achievements in terms of the problems you were trying to solve, your solutions, and the impact it has today and in the future?

Since my early days in multimedia, when we were retrieving video shots of airplanes, until today, when we classify creative videos or interesting pictures, I would say that the main contribution of my research has been to “break the boundaries”.
We broke the scientific field boundaries. We designed multimedia algorithms inspired by the visual arts and psychology; we collaborated with experts from philosophy, media history, sociology; and we could deliver creative, interdisciplinary research works which would contribute to the advancement of multimedia and all the fields involved.

We broke the social network boundaries: with models able to quantify the intrinsic quality of images in a photo sharing platform. Furthermore, we showed that popularity-driven mechanisms, typical of social networks, fail to promote high-quality content, and that only content-based quality assessment tools could restore meritocracy in online media platforms.

We broke the cultural boundaries: together with an amazing multi-cultural research team, we were able to design computer vision models that can adapt to different cultures and language communities. While the effectiveness of our approaches and the scientific growth is per-se a main achievement, the publications resulting from this collaborative effort reached the top-level Computer Vision, Multimedia and Social media conferences (with a best paper award – ICWSM -and a multimodal best paper award – ICMR) and our work was featured by a number of tech journals and in a TedX presentation. Together with other scientists, we also started a number of initiatives to gather people from different communities who are interested in this area: a special session at ICMR 2017, a workshop at MM 2017, one at CVPR 2018, and, a special issue of ACM TOMM.

What are in your opinion the future topics in multimedia? Where is the community strong, and where could it improve or increase focus?

My feeling is that we should re-discover and empower the ‘multi-’ness of our research field.
I think the beauty of multimedia research is the ability to tell compelling multimodal stories from signals of very diverse nature, with a focus on the positive experience of the user. We are able to process multiple sources of information and use them, for example, to generate multi-sensorial artistic compositions, expose interesting findings about users and their behavior in multiple modalities, or provide tools to explore and align multimodal information, allowing easier knowledge absorption. We should not forget the diversity of modalities we are able to process (e.g. music or social signals, or traditional image data), the types of attributes we can draw from these modalities (e.g. sentiment or appeal, or more binary semantic labels), and the variety of applications scenarios we can imagine for our research works (e.g. arts, photography, cooking, or more consolidated use cases, such as image search or retrieval). And we should encourage emerging topics and applications towards these ‘multi-nesses’.
Beyond multidisciplinarity and multiple modalities, I would also hope to see more multi-cultural research works: given the beautifully diverse world we are part of, I believe multimedia research works and applications should model and take into account the multiple points of views, diverse perceptual responses, as well as the cultural and language differences of users around the world.

Miriam nowadays.

Miriam nowadays.

Over your distinguished career, what are your top lessons you want to share with the audience?

I am not sure if this is a real lesson, more something I deeply believe in. Stereotypes kill ideas. Stereotyping on others (colleagues, friends) might make communication, brainstorming, aor collective problem solving much harder, because it somehow influences the importance given to other people ideas. Also, stereotyping on oneself and one’s limits might constrain the possibilities and narrow one’s view on the shapes of possible future paths.

How was it to have a sister working in the same field of research? Is it motivation or pressure? Did you collaborate on some topics?

In one word: inspiring. We never officially collaborated in any research work. Unofficially, we’ve been ‘collaborating’ for 32 years :) (Interview with Judith Redi)

Diversity and Credibility for Social Images and Image Retrieval

Social media has established itself as an inextricable component of today’s society. Images make up a large proportion of items shared on social media [1]. The popularity of social image sharing has contributed to the popularity of the Retrieving Diverse Social Images task at the MediaEval Benchmarking Initiative for Multimedia Evaluationa [2]. Since its introduction in 2013, the task has attracted a large participation and has published a set of datasets of outstanding value to the multimedia research community.

The task, and the datasets it has released, target a novel facet of multimedia retrieval, namely the search result diversification of social images. The task is defined as follows: Given a large number of images, retrieved by a social media image search engine, find those that are not only relevant to the query, but also provide a diverse view of the topic/topics behind the query (see an example in Figure 1). The features and methods needed to address the task successfully are complex and span different research areas (image processing, text processing, machine learning). For this reason, when creating the collections used in the Retrieving Diverse Social Images Tasks, we also created a set of baseline features. The features are released with the datasets. In this way, task participants who have expertise in one particular research area may focus on that area and still participate in the full evaluation.

Figure 1: Example of retrieval and diversification results for query “Pingxi Sky Lantern Festival” (results are truncated to the first 14 images for better visualization): (top images) Flickr initial retrieval results; (bottom images) diversification achieved with the approach from the TUW team (best approach at MediaEval 2015).

Figure 1: Example of retrieval and diversification results for query “Pingxi Sky Lantern Festival” (results are truncated to the first 14 images for better visualization): (top images) Flickr initial retrieval results; (bottom images) diversification achieved with the approach from the TUW team (best approach at MediaEval 2015).

The collections

Before describing the individual collections, it needs to be noted that all data consist of redistributable Creative Commons Flickr and Wikipedia content and are freely available for download (follow the instructions here [3]). Although the task ran also in 2017, we focus in the following on the datasets already released, namely: Div400, Div150Cred, Div150Multi and Div150Adhoc (corresponding to the 2013-2016 evaluation campaigns). Each of the four datasets available so far covers different aspects of the diversification challenge, either from the perspective of the task/use-case addressed, or from the data that can be used to address the task. Table 1 gives an overview of the four datasets that we describe in more detail over the next four subsections. Each of the datasets is divided into a development set and a test set. Although the division of development and test data is arbitrary, for comparability of results and full reproducibility, users of the collections are advised to maintain the separation when performing their experiments.

Table 1: Dataset statistics (devset – development data, testset – testing data, credibilityset – data for estimating user tagging credibility, single (s) – single topic queries, multi (m) – multi-topic queries, ++ – enhanced/updated content, POI – location point of interest, events – events and states associated with locations, general – general purpose ad-hoc topics).table1


In 2013, the task started with a narrowly defined use-case scenario, where a tourist, upon deciding to visit a particular location, reads the corresponding Wikipedia page and desires to see a diverse set of images from that location. Queries here might be “Big Ben in London” or “Palazzo delle Albere in Italy”. For each such query, we know the GPS coordinates, the name, and the Wikipedia page, including an example image of the destination. As a search pool, we consider the top 150 photos obtained from Flickr using the name as a search query. These photos come with some metadata (photo ID, title, description, tags, geotagging information, date when the photo was taken, owner’s name, number of times the photo has been displayed, URL in Flickr, license type, number of comments on the photo) [4].

In addition to providing the raw data, the collection also contains visual and text features of the data, such that researchers who are only interested in one of the two, can use the other without investing additional time in generating a baseline set of features.

As visual descriptors, for each of the images in the collection, we provide:

  • Global color naming histogram
  • Global histogram of oriented gradients
  • Global color moments on HSV
  • Global Locally Binary Patterns on gray scale
  • Global Color Structure Descriptor
  • Global statistics on gray level Run Length Matrix (Short Run Emphasis, Long Run Emphasis, Gray-Level Non-uniformity, Run Length Non-uniformity, Run Percentage, Low Gray-Level Run Emphasis, High Gray-Level Run Emphasis, Short Run Low Gray-Level Emphasis, Short Run High Gray-Level Emphasis, Long Run Low Gray-Level Emphasis, Long Run High Gray-Level Emphasis)
  • Local spatial pyramid representations (3×3) of each of the previous descriptors

As textual descriptors we provide the classic Term Frequency (TFt,d – the number of occurrences of term t in document d) and Document Frequency (DFt – the number of documents containing term t). Note that the datasets are not limited to a single notion of document. The most direct definition of a “document” is an image that can be either retrieved or not retrieved. However, it is easily conceivable that the relative frequency of a term in the set of images corresponding to one topic, or the set of images corresponding to one user might also be of interest in ranking the importance of a result to a query. Therefore, the collection also contains statistics that take a document to be a topic, as well as a user. All these are provided both as CSV files, as well as Lucene Index files. The former can be used as part of a custom weighting scheme, while the latter can be deployed directly in a Lucene/Solr search engine to obtain results based on the text without further effort.


The tourism use case also underlies Div150Cred, but a component addressing the concept of user tagging credibility is added. The idea here is that not all users tag their photos in a manner that is useful for retrieval and, for this reason, it makes sense to consider, in addition to the visual and text descriptors also used in Div400, another feature set – a user credibility feature. Each of the 153 topics (30 in the development set and 123 in the test set) comes therefore, in addition to the visual and text features of each image, with a value indicating the credibility of the user. This value is estimated automatically based on a set of features, so in addition to the retrieval development and test sets, DIV150Cred also contains a credibility set, used by us to generate the credibility of each user, and which can be used by any interested researcher to generate better credibility estimators.

The credibility set contains images for approximately 300 locations from 685 users (a total of 3.6 million images). For each user there is a manually assigned credibility score as well as an automatically estimated one, based on the following features:

  • Visual score – learned predictor of a user’s consistent and relevant tagging behavior
  • Face proportion
  • Tag specificity
  • Location similarity
  • Photo count
  • Unique tags
  • Upload frequency
  • Bulk proportion

For each of these, the intuition behind it and the actual calculation is detailed in the collection report [5].


Div150Multi adds another twist to the task of the search engine and its tourism use-case. Now, the topics are not simply points of interest, but rather a combination of a main concept and a qualifier, namely multi-topic queries about location specific events, location aspects or general activities (e.g., “Oktoberfest in Munich”, “Bucharest in winter”). In terms of features however, the collection builds on the existing ones used in Div400 and Div150Cred, but adds to the pool of resources the researchers have at their disposal. In terms of credibility, in addition to the 8 features listed above, we now also have:

  • Mean Photo Views
  • Mean Title Word Counts
  • Mean Tags per Photo
  • Mean Image Tag Clarity

Again, for details on the intuition and formulas behind these, the collection report [6] is the reference material.

A new set of descriptors has been now made available, based on convolutional neural networks.

  • CNN generic: a descriptor based on the reference convolutional (CNN) neural network model provided along with the Caffe framework [7]. This model is trained with the 1,000 ImageNet classes used during the ImageNet challenge. The descriptors are extracted from the last fully connected layer of the network (named fc7).
  • CNN adapted: These features were also computed using the Caffe framework, with the reference model architecture but using images of 1,000 landmarks instead of ImageNet classes. We collected approximately 1,200 Web images for each landmark and fed them directly to Caffe for training [8]. Similar to CNN generic, the descriptors were extracted from the last fully connected layer of the network (i.e., fc7).


For this dataset, the definition of relevance was expanded from previous years, with the introduction of even more challenging multi-topic queries unrelated to POIs. These queries address the diversification problem for a general ad-hoc image retrieval system, where general-purpose multi-topic queries are used for retrieving the images (e.g., “animals at Zoo”, “flying planes on blue sky”, “hotel corridor”). The Div150Adhoc collection includes most of the previously described credibility descriptors, but drops faceProportion and location-Similarity, as they were no longer relevant for the new retrieval scenario. Also, the visualScore descriptor was updated in order to keep up with the latest advancements on CNN descriptors. Consequently, when training individual visual models, the Overfeat visual descriptor is replaced by the representation produced by the last fully connected layer of the network [9]. Full details are available in the collection report [10].

Ground-truth and state-of-the-art

Each of the above collections comes with an associated ground-truth, created by human assessors. As the focus is on both relevance and diversity, the ground truth and the metrics used reflect it: Precision at cutoff (primarily P@20) is used for relevance, and Cluster Recall at cutoff (primarily CR@20) is used for diversity.

Figure 2 shows an overview of the results obtained by participants in the evaluation campaigns over the period 2013-2016, and serves as a baseline for future experiments on these collections. Results presented here are on the test set alone. The reader may find more information about the methods in the MediaEval proceedings, which are listed on the Retrieving Diverse Social Images yearly task pages on the MediaEval website (

Figure 2

Figure 2. Evolution of the diversification performance (boxplots — the interquartile range (IQR), i.e. where the 50% of the values are; the line within the box = median; the tails = 1.5*IQR; the points outside (+) = outliers) for the different datasets in terms of precision (P) and cluster recall (CR) at different cut-off values. Flickr baseline represents the initial Flickr retrieval result for the corresponding dataset.


The Retrieving Diverse Social Image task datasets, as their name indicates, address the problem of retrieving images taking into account both the need to diversify the results presented to the user, as well as the potential lack of credibility of the users in their tagging behavior. They are based on already state-of-the-art retrieval technology (i.e., the Flickr retrieval system), which makes it possible to focus on the challenge of image diversification. Moreover, the data sets are not limited to images, but rather also include rich social information. The credibility component, represented by the credibility subsets of the last three collections, is unique to this set of benchmark datasets.


The Retrieving Diverse Social Image task datasets were made possible by the effort of a large team of people over an extended period of time. The contributions of the authors were essential. Further, we would like to acknowledge the multiple team members who have contributed to annotating the images and making the MediaEval Task possible. Please see the yearly Retrieving Diverse Social Images task pages on the MediaEval website (


Should you have any inquires or questions about the datasets, don’t hesitate to contact us via email at: bionescu at imag dot pub dot ro.


[1] (last visited 2017-11-29).






[7] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding” in ACM International Conference on Multimedia, 2014, pp. 675–678.

[8] E. Spyromitros-Xioufis, S. Papadopoulos, A. L. Ginsca, A. Popescu, Y. Kompatsiaris, and I. Vlahavas, “Improving diversity in image search via supervised relevance scoring” in ACM International Conference on Multimedia Retrieval, 2015, pp. 323–330.

[9] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets” arXiv preprint arXiv:1405.3531, 2014.


Multidisciplinary Column: Inclusion at conferences, my ISMIR experiences

In 2009, I attended my very first international conference. At that time, I recently had graduated for my Master’s degree in Computer Science, and just was starting the road towards a PhD; in parallel, I had also started pursuing my Master’s degree in Piano Performance at the conservatoire. As a computer scientist, I had conducted my MSc thesis project on cover song retrieval, which had resulted in an accepted paper at ISMIR, the yearly conference of the International Society of Music Information Retrieval.

That something like ‘Music Information Retrieval’ (Music-IR) existed, in which people performed computer science research in the music domain, fascinated me deeply. While I was training to become both a musician and a computer scientist, up to that point, I mostly had been encouraged to keep these two worlds as segregated as possible. As a music student, I would be expected to be completely and exclusively committed to my instrument; I often felt like a cheater when I was working on my computer science assignments. As a computer scientist, many of my music interests would be considered to be on the ‘artistic’, ‘subjective’ or even ‘fluffy’ side; totally fine if that was something I wanted to spend my hobby time on, but seriously integrating this with cold, hard computer science techniques seemed quite unthinkable.

Rather than having gone to a dedicated Music-IR group, I had remained at Delft University of Technology for my education, seeing parallels between the type of Multimedia Computing research done in the group of Alan Hanjalic, and problems I wanted to tackle in the music domain. However, that did mean I was the only one working on music there, and thus, that I was going to travel on my own to this conference…to Kobe, Japan, literally on the other end of the globe.

On the first day, I felt as impressed as I felt intimidated and lonely. All those people whose work I had read for years now became actual human beings I could talk to. Yet, I would not quite dare walking up to them myself…surely, they would have more interesting topics to discuss with more interesting people than me!

However, I was so lucky to get ‘adopted’ by Frans Wiering from Utrecht University, a well-known senior member of the community, who knew me from The Netherlands, as I had attended a seminar surrounding the thesis defense of one of his PhD students in the past. Before I got the chance to silently vanish into a corner of the reception room, he started proactively introducing me to the many people he was talking to himself. In the next days, I naturally started talking to these people as a consequence, and became increasingly confident in initiating new contacts myself.

With ISMIR being a single-track conference, I got the chance to soak up a very diverse body of work, presented by a very diverse body of researchers, with backgrounds ranging from machine learning to musicology. At one point, there was a poster session in which I discussed a signal processing algorithm with one of the presenters, turned around, literally remaining at the same physical location, and then discussed historical music performance practice with the opposite presenter. At this venue, the two parts of my identity which I so far had largely kept apart, turned out to actually work out very well together.

I attended many ISMIRs since, and time and time again, I kept seeing confirmations that a diversity of backgrounds, within attendees and between attendees, was what made the conference strong and inspiring. Whether we identify as researchers in signal processing, machine learning, library sciences, musicology, or psychology, what connects us all is that we look at music (and personally care about music), which we validly can do in parallel, each from our respective dedicated specialisms.

We do not always speak the same professional language, and we may validate in different ways. It requires effort to understand one another, more so than if we would only speak to people within our own niche specializations. But there is a clear willingness to build those bridges, and learn from one another. As one example, this year at ISMIR 2017, I was invited on a panel on the Future of Music-IR research, and each of the panelists was asked what works or research directions outside of the Music-IR community we would recommend for the community to familiarize with. I strongly believe that discussions like this, aiming to expand our horizons, are what we need at conferences…and what truly legitimizes us traveling internationally to exchange academic thoughts with our peers in person.

I also have always found the community extremely supportive in terms of reviewing. Even in case of rejections, one would usually receive a constructive review back, with multiple concrete pointers for improvements. Thanks to proactive TPC member actions and extensive reviewer guidelines with examples, the average review length for papers submitted to the ISMIR conference went up from 390 words in 2016 to 448 words in 2017.

As this was the baseline I was originally used to, my surprise was great when I first got confronted with the feared ‘two-line review’…as sadly turned out, that actually turned out the more common type of review in research at large. We recently have been discussing this within the SIGMM community, and in those discussions, more extensive reviewer guidelines seemed to be considered a case of ‘TL;DR’ (‘reviewers are busy enough, they won’t have time to read that’). But this is a matter of how we want our academic culture to be. Of course, a thorough and constructive review needs more time commitment than a two-line review, and this may become a problem in situations of high reviewer load. But rather than silently trying to hack the problem as individual reviewers (with more mediocre attention as likely consequence), maybe we should be more consciously selective of what we can handle, and openly discuss it with the community in case we run into capacity issues.

Back to the ISMIR community, more institutionally, inclusion has become a main focus point now. In terms of gender inclusion, a strong Women in MIR (WiMIR) group emerged in the past years, enabling an active mentoring program, and arranging for travel grant sponsoring to support conference attendance of female researchers. But impact reaches beyond gender inclusion. WiMIR also introduced a human bingo at its receptions, for which conference attendees with various characteristics (e.g. ‘has two degrees’, ‘attended the conference more than five times’, ‘is based in Asia’) need to be identified. A very nice and effective way to trigger ice-breaking activities, and to have attendees actively seeking out people they did not speak with yet. That the responsibility to get included at events should not only fall upon new members, but actively should be championed by the existing ‘insiders’, also recently was emphasized in this great post by Eric Holscher.

So, is ISMIR the perfect academic utopia? No, of course we do have our issues. As a medium-sized community, fostering cross-domain interaction goes well, but having individual specializations gain sufficient momentum needs an explicit outlook beyond our own platform. And we also have some status issues. Our conference, being run by an independent society, is frequently omitted from conference rankings; however, the independence is on purpose, as this will better foster accessibility of the venue towards other disciplines. And with an average acceptance rate around 40%, we often are deemed as ‘not sufficiently selective’…but in my experience, there usually is a narrow band of clear accepts, a narrow band of clear rejects, and a broad grey-zone band in the middle. And in more selective conferences, the clear rejects typically have a larger volume, and are much worse in quality, than the worst submissions I have ever seen at ISMIR.

In any case, given the ongoing discussions about SIGMM conferences, multidisciplinarity and inclusion, I felt that sharing some thoughts and observations from this neighboring community would be useful.

And…I really look forward already to serving as a general co-chair of ISMIR’s 20th anniversary in 2019—which will be exactly 10 years after my first, shy debut in the field.

About the Column

The Multidisciplinary Column is edited by Cynthia C. S. Liem and Jochen Huber. Every other edition, we will feature an interview with a researcher performing multidisciplinary work, or a column of our own hand. For this edition, we feature a column by Cynthia C. S. Liem.

Dr. Cynthia C. S. Liem is an Assistant Professor in the Multimedia Computing Group of Delft University of Technology, The Netherlands, and pianist of the Magma Duo. She initiated and co-coordinated the European research project PHENICX (2013-2016), focusing on technological enrichment of symphonic concert recordings with partners such as the Royal Concertgebouw Orchestra. Her research interests consider music and multimedia search and recommendation, and increasingly shift towards making people discover new interests and content which would not trivially be retrieved. Beyond her academic activities, Cynthia gained industrial experience at Bell Labs Netherlands, Philips Research and Google. She was a recipient of the Lucent Global Science and Google Anita Borg Europe Memorial scholarships, the Google European Doctoral Fellowship 2010 in Multimedia, and a finalist of the New Scientist Science Talent Award 2016 for young scientists committed to public outreach.

Dr. Jochen Huber is a Senior User Experience Researcher at Synaptics. Previously, he was an SUTD-MIT postdoctoral fellow in the Fluid Interfaces Group at MIT Media Lab and the Augmented Human Lab at Singapore University of Technology and Design. He holds a Ph.D. in Computer Science and degrees in both Mathematics (Dipl.-Math.) and Computer Science (Dipl.-Inform.), all from Technische Universität Darmstadt, Germany. Jochen’s work is situated at the intersection of Human-Computer Interaction and Human Augmentation. He designs, implements and studies novel input technology in the areas of mobile, tangible & non-visual interaction, automotive UX and assistive augmentation. He has co-authored over 60 academic publications and regularly serves as program committee member in premier HCI and multimedia conferences. He was program co-chair of ACM TVX 2016 and Augmented Human 2015 and chaired tracks of ACM Multimedia, ACM Creativity and Cognition and ACM International Conference on Interface Surfaces and Spaces, as well as numerous workshops at ACM CHI and IUI. Further information can be found on his personal homepage: