Multidisciplinary Column: An Interview with Emilia Gómez

Could you tell us a bit about your background, and what the road to your current position was?

I have a technical background in engineering (telecommunication engineer specialized in signal processing, PhD in Computer Science), but I also followed formal musical studies at the conservatory since I was a child. So I think I have an interdisciplinary background.

Could you tell us a bit more about how you have encountered multidisciplinarity and interdisciplinarity both in your work on music information retrieval and your current project on human behavior and machine intelligence?

Music Information Retrieval (MIR) is itself a multidisciplinarity research area intended to help humans better make sense of this data. MIR draws from a diverse set of disciplines, including, but by no means limited to, music theory, computer science, psychology, neuroscience, library science, electrical engineering, and machine learning.

In my current project HUMAINT at the Joint Research Centre of the European Commission, we try to understand the impact that algorithms will have on humans, including our decision making and cognitive capabilities. This challenging topic can only be addressed in a holistic way and by incorporating insights from different disciplines. At our kick-off workshopwe gathered researchers working on distant fields, e.g. from computer science to philosophy, including law, neuroscience and psychology and we realised the need to engage on scientific discussions from different views and perspectives to address human challenges in a holistic way.

What have, in your personal experience, been the main advantages of multidisciplinarity and interdisciplinarity? Have you also encountered any disadvantages or obstacles?

The main advantage I see is the fact that we can combine distinct methodologies to generate new insights. For researchers, the fact of stepping out a discipline’s comfort zone makes us more creative and innovative.

One disadvantage is the fact that when you work on a multidisciplinary field you seem not to fit into traditional academic standards. In my case, I am perceived as a musician by engineers and as an engineer by musicians.

Beyond the academic community, your work also closely connects to interests by diverse types of stakeholders (e.g. industry, policy-makers). In your opinion, what are the most challenging aspects for an academic to operate in such a diverse stakeholder environment?

The most challenging part of diverse teams is communication, e.g. being able to speak the same language (we might need to create interdisciplinary glossaries!) and explain about our research in an accessible way so that it is understood by people with diverse backgrounds and expertises.

Regarding your work on music, you often have been speaking about making all music accessible to everyone. What do you consider the grand research challenges regarding this mission?

Many MIR researchers desire that technology can be used to make all music accessible to everyone, i.e. that our algorithms can help people discover new music, develop a varied musical taste and make them open to new music and, at the same time, to new ideas and cultures. We often talk of our desire that MIR algorithms help people discover music in the so called ´long tail`, i.e. music that is not so popular or present in the mainstream scenario. I believe the variety of music styles reflect the variety of human beings, e.g. in terms of culture, personalities and ideas. Through music we can then enrich our culture and understanding.

As the newly elected president of the ISMIR society, are there any specific missions regarding the community you would like to emphasize?

I have had the chance to work with an amazing ISMIR board over the last years, an incredible group of people willing to contribute to our community with their talent and time. With this team is very easy to work! 

This year, ISMIR is organizing its 19th edition (yes, we are getting old)! There are many challenges at ISMIR that we as a community should address, but at the moment I would like to emphasize some relevant aspects that are now somehow a priority for the board.

The first one is to maintain and expand its scientific excellence, as ISMIR should continue to provide key scientific advancements in our field. In this respect, we have recently launched our open access journal Transactions of ISMIR to foster the publication of more deep and mature research works in our area.

The second one is to promote variety in our community, e.g. in terms of discipline, gender or geographical location, also related to music culture and repertoire. In this respect, and thanks to our members, we have promoted ISMIR taking place at different locations, including editions in Asia (e.g. 2014 in Taipei, Taiwan, and 2017 in Suzhou, China).

Other aspects we put into value is reproducibility, openness and accessibility. In this sense, our priority is to maintain affordable registration rates, taking advantage of sponsorships from our industrial members, and devote our membership fees to provide travel funds for students or other members in need to attend ISMIR.

How and in what form do you feel we as academics can be most impactful?

The academic environment gives you a lot of flexibility and freedom to define research roadmaps, although there are always some dependencies on funding. In addition, academia provides time  to reflect and go deep into problems that are not directly related to a product in a short-term. In the technological field, academia has the potential to advance technologies by focusing on deeper understanding of why these technologies work well or not, e.g. through theoretical analysis or comprehensive evaluation

You also have been very engaged in missions surrounding Women in STEM, for example through the Women in MIR initiatives. In discussions on fostering diversity, the importance of role models is frequently mentioned. How can we be good role models?

Yes, I have become more and more concerned about the lack of opportunities that women have in our field with respect to their male colleagues. In this sense, Women in MIR is playing a major role in promoting the role and opportunities of women in our field, including a mentoring program, funding for women to attend ISMIR, and the creation of a public repository of female researchers to make them more visible and present.

I think women are already great role models in their different profiles, but they lack visibility with respect to their male colleagues.


Bios

Dr. Emilia Gómez graduated as a Telecommunication Engineer at Universidad de Sevilla and studied piano performance at the Seville Conservatoire of Music, Spain. She then received a DEA in Acoustics, Signal Processing and Computer Science applied to Music at IRCAM, Paris and a PhD in Computer Science at Universitat Pompeu Fabra in Barcelona (2006). She has been visiting researcher at the Royal Institute of Technology, Stockholm (Marie Curie Fellow, 2003), McGill University, Montreal (AGAUR competitive fellowship. 2010), and Queen Mary University of London (José de Castillejos competitive fellowship, 2015). After her PhD, she was first a lecturer in Sonology at the Higher School of Music of Catalonia and then joined the Music Technology Group, Department of Information and Communication Technologies,  Universitat Pompeu Fabra in Barcelona, Spain, first as an assistant professor and then as an associate professor (2011) and ICREA Academia fellow (2015). In 2017, she became the first female president of the International Society for Music Information Retrieval, and in January 2018, she joined the Joint Research Centre of the European Commission as Lead Scientist of the HUMAINT project, studying the impact of machine intelligence into human behavior.

Editor Biographies

Cynthia_Liem_2017Dr. Cynthia C. S. Liem is an Assistant Professor in the Multimedia Computing Group of Delft University of Technology, The Netherlands, and pianist of the Magma Duo. She initiated and co-coordinated the European research project PHENICX (2013-2016), focusing on technological enrichment of symphonic concert recordings with partners such as the Royal Concertgebouw Orchestra. Her research interests consider music and multimedia search and recommendation, and increasingly shift towards making people discover new interests and content which would not trivially be retrieved. Beyond her academic activities, Cynthia gained industrial experience at Bell Labs Netherlands, Philips Research and Google. She was a recipient of the Lucent Global Science and Google Anita Borg Europe Memorial scholarships, the Google European Doctoral Fellowship 2010 in Multimedia, and a finalist of the New Scientist Science Talent Award 2016 for young scientists committed to public outreach.

 

 

jochen_huberDr. Jochen Huber is a Senior User Experience Researcher at Synaptics. Previously, he was an SUTD-MIT postdoctoral fellow in the Fluid Interfaces Group at MIT Media Lab and the Augmented Human Lab at Singapore University of Technology and Design. He holds a Ph.D. in Computer Science and degrees in both Mathematics (Dipl.-Math.) and Computer Science (Dipl.-Inform.), all from Technische Universität Darmstadt, Germany. Jochen’s work is situated at the intersection of Human-Computer Interaction and Human Augmentation. He designs, implements and studies novel input technology in the areas of mobile, tangible & non-visual interaction, automotive UX and assistive augmentation. He has co-authored over 60 academic publications and regularly serves as program committee member in premier HCI and multimedia conferences. He was program co-chair of ACM TVX 2016 and Augmented Human 2015 and chaired tracks of ACM Multimedia, ACM Creativity and Cognition and ACM International Conference on Interface Surfaces and Spaces, as well as numerous workshops at ACM CHI and IUI. Further information can be found on his personal homepage: http://jochenhuber.com

An interview with Miriam Redi

Miriam nowadays.

Miriam at the begin of her research career.

Miriam at the begin of her research career.

Describe your journey into computing from your youth up to the present. What foundational lessons did you learn from this journey? Why were you initially attracted to multimedia?

I literally grew up with computers all around me. I was born in a little town raised around the headquarters of Olivetti, one of the biggest tech companies of the last century: becoming a computer geek, in that place, at that time, was easier than usual! I have always been fascinated by the power of visuals and music to convey ideas. I loved to learn about history and the world through songs and movies. How to merge my love for computers with my passion for the audiovisual arts? I enrolled  in Media Engineering studies, where, aside from the traditional Computer Engineering knowledge, I had the chance to learn more about media history and design. The main message? Multidisciplinarity is key. We cannot design intelligent multimedia technologies without deeply understanding how a media is created, perceived and distributed.

Talking about multidisciplinary, what do you think is the current state of multidisciplinarity in the multimedia community?

My impression is that, due to the inherent multimodality of our research, our community has developed a natural ability of blending techniques and theories from various domains. I believe we can push the boundaries of this multidisciplinarity even further. I am thinking, for example, of that MM subcommunity interested in mining subjective attributes from data, such as mood, sentiment, or beauty. I believe such research works could incredibly benefit from a collaboration between MM scientists and domain experts in psychology, cognitive science, visual perception, or visual arts.

Tell us more about your vision and objectives behind your current roles? What do you hope to accomplish and how will you bring this about?

My dream is to make multimedia science even more useful for society and for collective growth. Multimedia data allows to easily absorb and communicate knowledge, without language barriers. Producing and generating audiovisual content has never been easier: today, the potential of multimedia for learning and sharing human knowledge is unprecedented! Intelligent multimedia systems could be put in place to support editors communities in making free online encyclopedias like Wikipedia or collaborative knowledge bases like Wikidata more “visual” – and therefore less tied to individual languages. By doing so, we could increase the possibility for people around the world to freely access the sum of all knowledge.

I like your approach about making something useful for society. What do you think about the criticism that multimedia research is too applied?

For me, high-quality research means creative research. Where ‘creative’ means ‘new and valuable’. The coexistence of breath and depth in Multimedia allows to create novel and useful applied research works, thus making these, to me, as interesting as inspiring as more theoretical research works.

Can you profile your current research, its challenges, opportunities, and implications?

I work on responsible multimedia algorithms. I love building machines that can classify audiovisual and textual data according to subjective properties – for example, the informativeness of an image with respect to a topic, its epistemic value, the beauty of a photo, the creative degree of a video. Given the inherently subjective nature of these algorithms, one of the main challenges of my research is to make such models responsible, namely:
1) Diversity-Aware i.e. reflecting the real subjective perception of people with different cultural backgrounds; this is key to empower specific cultures, designing AI to grow diversified content and fill the knowledge gaps in online knowledge repositories.
2) Interpretable and Unbiased, namely not only able to classify content, but also able to say why the content was classified in a certain way (so that we can detect algorithmic bias). Such powerful algorithms can be used to study the visual preferences of users of web and social media platforms, and retrieve interesting content accordingly.

Do you think that one day we will have algorithms that truly understand human perception of beauty and art? Or will it always be depended on the data?

Philosophers have been triying for centuries to understand the true nature of aesthetic perception. In general, I do not believe in absolute truths. And I am not really confident that algorithms will be able to become great philosophers anytime soon.

How would you describe the role of women especially in the field of multimedia?

The role of women in multimedia is the role of any researcher in their scientific community: contribute to scientific development, push the boundaries of what is known, doubt the widely accepted notions, make this world a better place (no pressure!). Maintaining diversity (any kind of diversity – including gender, expertise, race, age) in the scientific discourse is crucial: as opposed to a single mono-culture, a diverse community gathers, elaborates and combines different perspectives, thus forcing a collective creative process of exchange and growth, which is essential to scientific development.

Do you think that female researchers are well presented in the multimedia community? For example, there was not female keynote speaker at ACM MM 2017.

I am not sure about the numbers, so I can’t say for sure the percentage of women and non-binary gender persons in the multimedia community. But I am sure that percentage is greater than 0. When filling positions of high visibility such as keynotes or committee members, I we should always keep in mind that one of our tasks is to inspire younger generations. Generations of young, brilliant, beautifully diverse researchers.

How would you describe your top innovative achievements in terms of the problems you were trying to solve, your solutions, and the impact it has today and in the future?

Since my early days in multimedia, when we were retrieving video shots of airplanes, until today, when we classify creative videos or interesting pictures, I would say that the main contribution of my research has been to “break the boundaries”.
We broke the scientific field boundaries. We designed multimedia algorithms inspired by the visual arts and psychology; we collaborated with experts from philosophy, media history, sociology; and we could deliver creative, interdisciplinary research works which would contribute to the advancement of multimedia and all the fields involved.

We broke the social network boundaries: with models able to quantify the intrinsic quality of images in a photo sharing platform. Furthermore, we showed that popularity-driven mechanisms, typical of social networks, fail to promote high-quality content, and that only content-based quality assessment tools could restore meritocracy in online media platforms.

We broke the cultural boundaries: together with an amazing multi-cultural research team, we were able to design computer vision models that can adapt to different cultures and language communities. While the effectiveness of our approaches and the scientific growth is per-se a main achievement, the publications resulting from this collaborative effort reached the top-level Computer Vision, Multimedia and Social media conferences (with a best paper award – ICWSM -and a multimodal best paper award – ICMR) and our work was featured by a number of tech journals and in a TedX presentation. Together with other scientists, we also started a number of initiatives to gather people from different communities who are interested in this area: a special session at ICMR 2017, a workshop at MM 2017, one at CVPR 2018, and, a special issue of ACM TOMM.

What are in your opinion the future topics in multimedia? Where is the community strong, and where could it improve or increase focus?

My feeling is that we should re-discover and empower the ‘multi-’ness of our research field.
I think the beauty of multimedia research is the ability to tell compelling multimodal stories from signals of very diverse nature, with a focus on the positive experience of the user. We are able to process multiple sources of information and use them, for example, to generate multi-sensorial artistic compositions, expose interesting findings about users and their behavior in multiple modalities, or provide tools to explore and align multimodal information, allowing easier knowledge absorption. We should not forget the diversity of modalities we are able to process (e.g. music or social signals, or traditional image data), the types of attributes we can draw from these modalities (e.g. sentiment or appeal, or more binary semantic labels), and the variety of applications scenarios we can imagine for our research works (e.g. arts, photography, cooking, or more consolidated use cases, such as image search or retrieval). And we should encourage emerging topics and applications towards these ‘multi-nesses’.
Beyond multidisciplinarity and multiple modalities, I would also hope to see more multi-cultural research works: given the beautifully diverse world we are part of, I believe multimedia research works and applications should model and take into account the multiple points of views, diverse perceptual responses, as well as the cultural and language differences of users around the world.

Miriam nowadays.

Miriam nowadays.

Over your distinguished career, what are your top lessons you want to share with the audience?

I am not sure if this is a real lesson, more something I deeply believe in. Stereotypes kill ideas. Stereotyping on others (colleagues, friends) might make communication, brainstorming, aor collective problem solving much harder, because it somehow influences the importance given to other people ideas. Also, stereotyping on oneself and one’s limits might constrain the possibilities and narrow one’s view on the shapes of possible future paths.

How was it to have a sister working in the same field of research? Is it motivation or pressure? Did you collaborate on some topics?

In one word: inspiring. We never officially collaborated in any research work. Unofficially, we’ve been ‘collaborating’ for 32 years :) (Interview with Judith Redi)

ACM Fellows in the SIGMM Community

Multimedia can be defined as the seamless integration of digital technologies in ways which provide for an enriched experience for users as we create and consume information with high fidelity.  Behind that definition lies a host of enabling digital technologies to allow us create, capture, store, analyse, index, locate, transmit and present information. But when did multimedia, as we now know it, start?  Was it the ideas of Vannevar Bush and Memex, or Ted Nelson and Xanadu or the development of Apple computers in the mid 1980s or maybe the emergence of the web which enables distribution of multimedia?

Certainly by the early 1990s and definitely by 1993 when SIGMM was founded, multimedia was established and ACM SIGMMrecognised as a mainstream activity within computing. Over the intervening two and a half decades we’ve seen tremendous progress, incredible developments and a wholesale adoption of our technologies right across our society.  All this has been achieved partly on the backs of innovations by many eminent scientists and technologists who are leaders within our SIGMM community.

We recently saw two of our SIGMM community elevated to the grade of ACM Fellow, joining the 52 other new ACM Fellows in the class of 2017. Our congratulations go to Yong Rui and to Shih-Fu Chang for their elevation to that grade. Yong had a lovely interview for SIGMM on the significance of this honour for him as a researcher, and for us all in SIGMM, which is available at http://sigmm.org/news/interview-dr-yong-rui-acm-fellow and its worth reflecting on some of our other SIGMM family who have been similarly honoured in the past.

While checking SIGMM membership is an easy thing to do (though its a bit more difficult to check back throughout our membership history) it is a bit arbitrary to define who is and who is not part of our SIGMM “family”.  To me its somebody who is, or has been, an active participant or organiser of our events, or a contributor to our field. Our SIGMM family includes those I would associate with SIGMM rather than any other SIG, and with ACM rather than with any other society.

In the class of new ACM Fellows for 2017 Shih-Fu Chang is elevated “for contributions to large-scale multimedia content recognition and multimedia information retrieval”. Shih-Fu is my predecessor as SIGMM chair and still serves on the SIGMM Executive as well as maintaining a hugely impressive research output.  He won the SIGMM Outstanding Technical Achievement Award in 2011. 

Yong Rui was also elevated to ACM Fellow in 2017 “for contributions to image, video and multimedia analysis, understanding and retrieval”.  Yong is a long-time supporter of SIGMM Conferences as well as a regular attendee and major contributor to our field.

Wen Gao of Peking University is vice president of the National Natural Science Foundation of China and was a co-chair of ACM Multimedia in 2009. He is also on the advisory board of ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) and was elevated to ACM Fellow in 2013 “for contributions to video technology, and for leadership to advance computing in China”.

Zhengyou Zhang was also elevated in 2013 “for contributions to computer vision and multimedia” and continues to serve the SIGMM community, most recently as Best Papers chair at MM 2017.

Klara Nahrstedt (class of 2012) was elevated “for contributions to quality-of-service management for distributed multimedia systems” and has served as SIGMM Chair prior to Shih-Fu. In 2014 Klara won the SIGMM Technical Achievement Award and until last year she also served on the SIGMM Executive Committee.

Joe Konstan (class of 2008) was elevated “for contributions to human-computer interaction” and he also won the ACM Software System Award in 2010. Joe was the ACM MM 2000 TPC Chair and was on the SIGMM Executive Committee from 1999 to 2007.

HongJiang Zhang (class of 2007) was elevated to Fellow “for contributions to content-based analysis and retrieval of multimedia”.  HongJiang also won the 2012 SIGMM Outstanding Technical Achievement Award and he has a huge publications output with a Google Scholar h-index of 120.

Ramesh Jain (class of 2003) was elevated “for contributions to computer vision and multimedia information systems”. Ramesh remains one of the most prolific authors in our field and a regular, almost omnipresent, attendee at our major SIGMM conferences. In 2010 Ramesh won the SIGMM Outstanding Technical Achievement Award.

Ralf Steinmetz (class of 2001) was elevated for “pioneering work in multimedia communications and education, including fundamental contributions in perceivable Quality of Service for multimedia systems derived from multimedia synchronization, and for multimedia education”. Ralf is also the winner of the inaugural  ACM SIGMM Technical Achievement Award, presented in 2008 and between 2009 and 2015 he served as Editor-in-Chief of the ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), formerly known as TOMCCAP.

Larry Rowe (class of 1998) was elevated “for seminal contributions to programming languages, relational database technology, user interfaces and multimedia systems”.  Larry is a past chair of SIGMM (1998-2003) and in 2009 he received the SIGMM Technical Achievement Award.
 
P. Venkat Rangan was elevated to ACM Fellow in 1998. At the recent ACM MM Conference in 2017, we had a short presentation on the first ACM MM Conference in 1993 and Venkat’s efforts in organising that first MM was acknowledged in that presentation. Venkat’s ACM Fellowship citation says that he “founded one of the foremost centers for research in multimedia, in which area he is an inventor of fundamental techniques with global impact”.

What is interesting to note about these awardees is the broad range of areas in which their contributions are grounded, covering the “traditional” areas of multimedia. These range from quality of service delivery across networks to analysis of content and from user interfaces and interaction to progress in fundamental computer vision.  This reflects the broad range of research areas covered by the SIGMM community, which has been part of our DNA since SIGMM was founded.

Our ACM Fellows are a varied and talented group of individuals, each richly deserving of their award and their only single unifying theme is broad multimedia, and that’s one of our distinguishing features. In some SIGs like SIGARCH (computer architecture), SIGCSE (computer science education) or SIGIR (information retrieval), there’s a focus on a narrow topic or a challenge while in other SIGs like SIGCHI (computer human interaction), SIGMOD (management of data) or SIGAI (artificial intelligence), there are a broad range of research areas.  SIGMM sits with those areas where our application and impact is broad.

The ACM Fellow awards started nearly 25 years ago. Further details can be found at https://awards.acm.org/award-nominations and a link to each of the awards can be found at https://awards.acm.org/fellows/award-winners

Diversity and Credibility for Social Images and Image Retrieval

Social media has established itself as an inextricable component of today’s society. Images make up a large proportion of items shared on social media [1]. The popularity of social image sharing has contributed to the popularity of the Retrieving Diverse Social Images task at the MediaEval Benchmarking Initiative for Multimedia Evaluationa [2]. Since its introduction in 2013, the task has attracted a large participation and has published a set of datasets of outstanding value to the multimedia research community.

The task, and the datasets it has released, target a novel facet of multimedia retrieval, namely the search result diversification of social images. The task is defined as follows: Given a large number of images, retrieved by a social media image search engine, find those that are not only relevant to the query, but also provide a diverse view of the topic/topics behind the query (see an example in Figure 1). The features and methods needed to address the task successfully are complex and span different research areas (image processing, text processing, machine learning). For this reason, when creating the collections used in the Retrieving Diverse Social Images Tasks, we also created a set of baseline features. The features are released with the datasets. In this way, task participants who have expertise in one particular research area may focus on that area and still participate in the full evaluation.

Figure 1: Example of retrieval and diversification results for query “Pingxi Sky Lantern Festival” (results are truncated to the first 14 images for better visualization): (top images) Flickr initial retrieval results; (bottom images) diversification achieved with the approach from the TUW team (best approach at MediaEval 2015).

Figure 1: Example of retrieval and diversification results for query “Pingxi Sky Lantern Festival” (results are truncated to the first 14 images for better visualization): (top images) Flickr initial retrieval results; (bottom images) diversification achieved with the approach from the TUW team (best approach at MediaEval 2015).


The collections

Before describing the individual collections, it needs to be noted that all data consist of redistributable Creative Commons Flickr and Wikipedia content and are freely available for download (follow the instructions here [3]). Although the task ran also in 2017, we focus in the following on the datasets already released, namely: Div400, Div150Cred, Div150Multi and Div150Adhoc (corresponding to the 2013-2016 evaluation campaigns). Each of the four datasets available so far covers different aspects of the diversification challenge, either from the perspective of the task/use-case addressed, or from the data that can be used to address the task. Table 1 gives an overview of the four datasets that we describe in more detail over the next four subsections. Each of the datasets is divided into a development set and a test set. Although the division of development and test data is arbitrary, for comparability of results and full reproducibility, users of the collections are advised to maintain the separation when performing their experiments.

Table 1: Dataset statistics (devset – development data, testset – testing data, credibilityset – data for estimating user tagging credibility, single (s) – single topic queries, multi (m) – multi-topic queries, ++ – enhanced/updated content, POI – location point of interest, events – events and states associated with locations, general – general purpose ad-hoc topics).table1

Div400

In 2013, the task started with a narrowly defined use-case scenario, where a tourist, upon deciding to visit a particular location, reads the corresponding Wikipedia page and desires to see a diverse set of images from that location. Queries here might be “Big Ben in London” or “Palazzo delle Albere in Italy”. For each such query, we know the GPS coordinates, the name, and the Wikipedia page, including an example image of the destination. As a search pool, we consider the top 150 photos obtained from Flickr using the name as a search query. These photos come with some metadata (photo ID, title, description, tags, geotagging information, date when the photo was taken, owner’s name, number of times the photo has been displayed, URL in Flickr, license type, number of comments on the photo) [4].

In addition to providing the raw data, the collection also contains visual and text features of the data, such that researchers who are only interested in one of the two, can use the other without investing additional time in generating a baseline set of features.

As visual descriptors, for each of the images in the collection, we provide:

  • Global color naming histogram
  • Global histogram of oriented gradients
  • Global color moments on HSV
  • Global Locally Binary Patterns on gray scale
  • Global Color Structure Descriptor
  • Global statistics on gray level Run Length Matrix (Short Run Emphasis, Long Run Emphasis, Gray-Level Non-uniformity, Run Length Non-uniformity, Run Percentage, Low Gray-Level Run Emphasis, High Gray-Level Run Emphasis, Short Run Low Gray-Level Emphasis, Short Run High Gray-Level Emphasis, Long Run Low Gray-Level Emphasis, Long Run High Gray-Level Emphasis)
  • Local spatial pyramid representations (3×3) of each of the previous descriptors

As textual descriptors we provide the classic Term Frequency (TFt,d – the number of occurrences of term t in document d) and Document Frequency (DFt – the number of documents containing term t). Note that the datasets are not limited to a single notion of document. The most direct definition of a “document” is an image that can be either retrieved or not retrieved. However, it is easily conceivable that the relative frequency of a term in the set of images corresponding to one topic, or the set of images corresponding to one user might also be of interest in ranking the importance of a result to a query. Therefore, the collection also contains statistics that take a document to be a topic, as well as a user. All these are provided both as CSV files, as well as Lucene Index files. The former can be used as part of a custom weighting scheme, while the latter can be deployed directly in a Lucene/Solr search engine to obtain results based on the text without further effort.

Div150Cred

The tourism use case also underlies Div150Cred, but a component addressing the concept of user tagging credibility is added. The idea here is that not all users tag their photos in a manner that is useful for retrieval and, for this reason, it makes sense to consider, in addition to the visual and text descriptors also used in Div400, another feature set – a user credibility feature. Each of the 153 topics (30 in the development set and 123 in the test set) comes therefore, in addition to the visual and text features of each image, with a value indicating the credibility of the user. This value is estimated automatically based on a set of features, so in addition to the retrieval development and test sets, DIV150Cred also contains a credibility set, used by us to generate the credibility of each user, and which can be used by any interested researcher to generate better credibility estimators.

The credibility set contains images for approximately 300 locations from 685 users (a total of 3.6 million images). For each user there is a manually assigned credibility score as well as an automatically estimated one, based on the following features:

  • Visual score – learned predictor of a user’s consistent and relevant tagging behavior
  • Face proportion
  • Tag specificity
  • Location similarity
  • Photo count
  • Unique tags
  • Upload frequency
  • Bulk proportion

For each of these, the intuition behind it and the actual calculation is detailed in the collection report [5].

Div150Multi

Div150Multi adds another twist to the task of the search engine and its tourism use-case. Now, the topics are not simply points of interest, but rather a combination of a main concept and a qualifier, namely multi-topic queries about location specific events, location aspects or general activities (e.g., “Oktoberfest in Munich”, “Bucharest in winter”). In terms of features however, the collection builds on the existing ones used in Div400 and Div150Cred, but adds to the pool of resources the researchers have at their disposal. In terms of credibility, in addition to the 8 features listed above, we now also have:

  • Mean Photo Views
  • Mean Title Word Counts
  • Mean Tags per Photo
  • Mean Image Tag Clarity

Again, for details on the intuition and formulas behind these, the collection report [6] is the reference material.

A new set of descriptors has been now made available, based on convolutional neural networks.

  • CNN generic: a descriptor based on the reference convolutional (CNN) neural network model provided along with the Caffe framework [7]. This model is trained with the 1,000 ImageNet classes used during the ImageNet challenge. The descriptors are extracted from the last fully connected layer of the network (named fc7).
  • CNN adapted: These features were also computed using the Caffe framework, with the reference model architecture but using images of 1,000 landmarks instead of ImageNet classes. We collected approximately 1,200 Web images for each landmark and fed them directly to Caffe for training [8]. Similar to CNN generic, the descriptors were extracted from the last fully connected layer of the network (i.e., fc7).

Div150AdHoc

For this dataset, the definition of relevance was expanded from previous years, with the introduction of even more challenging multi-topic queries unrelated to POIs. These queries address the diversification problem for a general ad-hoc image retrieval system, where general-purpose multi-topic queries are used for retrieving the images (e.g., “animals at Zoo”, “flying planes on blue sky”, “hotel corridor”). The Div150Adhoc collection includes most of the previously described credibility descriptors, but drops faceProportion and location-Similarity, as they were no longer relevant for the new retrieval scenario. Also, the visualScore descriptor was updated in order to keep up with the latest advancements on CNN descriptors. Consequently, when training individual visual models, the Overfeat visual descriptor is replaced by the representation produced by the last fully connected layer of the network [9]. Full details are available in the collection report [10].

Ground-truth and state-of-the-art

Each of the above collections comes with an associated ground-truth, created by human assessors. As the focus is on both relevance and diversity, the ground truth and the metrics used reflect it: Precision at cutoff (primarily P@20) is used for relevance, and Cluster Recall at cutoff (primarily CR@20) is used for diversity.

Figure 2 shows an overview of the results obtained by participants in the evaluation campaigns over the period 2013-2016, and serves as a baseline for future experiments on these collections. Results presented here are on the test set alone. The reader may find more information about the methods in the MediaEval proceedings, which are listed on the Retrieving Diverse Social Images yearly task pages on the MediaEval website (http://multimediaeval.org/).

Figure 2

Figure 2. Evolution of the diversification performance (boxplots — the interquartile range (IQR), i.e. where the 50% of the values are; the line within the box = median; the tails = 1.5*IQR; the points outside (+) = outliers) for the different datasets in terms of precision (P) and cluster recall (CR) at different cut-off values. Flickr baseline represents the initial Flickr retrieval result for the corresponding dataset.


Conclusions

The Retrieving Diverse Social Image task datasets, as their name indicates, address the problem of retrieving images taking into account both the need to diversify the results presented to the user, as well as the potential lack of credibility of the users in their tagging behavior. They are based on already state-of-the-art retrieval technology (i.e., the Flickr retrieval system), which makes it possible to focus on the challenge of image diversification. Moreover, the data sets are not limited to images, but rather also include rich social information. The credibility component, represented by the credibility subsets of the last three collections, is unique to this set of benchmark datasets.

Acknowledgments

The Retrieving Diverse Social Image task datasets were made possible by the effort of a large team of people over an extended period of time. The contributions of the authors were essential. Further, we would like to acknowledge the multiple team members who have contributed to annotating the images and making the MediaEval Task possible. Please see the yearly Retrieving Diverse Social Images task pages on the MediaEval website (http://multimediaeval.org/).

Contact

Should you have any inquires or questions about the datasets, don’t hesitate to contact us via email at: bionescu at imag dot pub dot ro.

References

[1] http://contentmarketinginstitute.com/2015/11/visual-content-strategy/ (last visited 2017-11-29).

[2] http://www.multimediaeval.org/

[3] http://www.campus.pub.ro/lab7/bionescu/publications.html#datasets

[4] http://campus.pub.ro/lab7/bionescu/Div400.html

[5] http://campus.pub.ro/lab7/bionescu/Div150Cred.html

[6] http://campus.pub.ro/lab7/bionescu/Div150Multi.html

[7] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding” in ACM International Conference on Multimedia, 2014, pp. 675–678.

[8] E. Spyromitros-Xioufis, S. Papadopoulos, A. L. Ginsca, A. Popescu, Y. Kompatsiaris, and I. Vlahavas, “Improving diversity in image search via supervised relevance scoring” in ACM International Conference on Multimedia Retrieval, 2015, pp. 323–330.

[9] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: Delving deep into convolutional nets” arXiv preprint arXiv:1405.3531, 2014.

[10] http://campus.pub.ro/lab7/bionescu/Div150Adhoc.html

Multidisciplinary Column: Inclusion at conferences, my ISMIR experiences

In 2009, I attended my very first international conference. At that time, I recently had graduated for my Master’s degree in Computer Science, and just was starting the road towards a PhD; in parallel, I had also started pursuing my Master’s degree in Piano Performance at the conservatoire. As a computer scientist, I had conducted my MSc thesis project on cover song retrieval, which had resulted in an accepted paper at ISMIR, the yearly conference of the International Society of Music Information Retrieval.

That something like ‘Music Information Retrieval’ (Music-IR) existed, in which people performed computer science research in the music domain, fascinated me deeply. While I was training to become both a musician and a computer scientist, up to that point, I mostly had been encouraged to keep these two worlds as segregated as possible. As a music student, I would be expected to be completely and exclusively committed to my instrument; I often felt like a cheater when I was working on my computer science assignments. As a computer scientist, many of my music interests would be considered to be on the ‘artistic’, ‘subjective’ or even ‘fluffy’ side; totally fine if that was something I wanted to spend my hobby time on, but seriously integrating this with cold, hard computer science techniques seemed quite unthinkable.

Rather than having gone to a dedicated Music-IR group, I had remained at Delft University of Technology for my education, seeing parallels between the type of Multimedia Computing research done in the group of Alan Hanjalic, and problems I wanted to tackle in the music domain. However, that did mean I was the only one working on music there, and thus, that I was going to travel on my own to this conference…to Kobe, Japan, literally on the other end of the globe.

On the first day, I felt as impressed as I felt intimidated and lonely. All those people whose work I had read for years now became actual human beings I could talk to. Yet, I would not quite dare walking up to them myself…surely, they would have more interesting topics to discuss with more interesting people than me!

However, I was so lucky to get ‘adopted’ by Frans Wiering from Utrecht University, a well-known senior member of the community, who knew me from The Netherlands, as I had attended a seminar surrounding the thesis defense of one of his PhD students in the past. Before I got the chance to silently vanish into a corner of the reception room, he started proactively introducing me to the many people he was talking to himself. In the next days, I naturally started talking to these people as a consequence, and became increasingly confident in initiating new contacts myself.

With ISMIR being a single-track conference, I got the chance to soak up a very diverse body of work, presented by a very diverse body of researchers, with backgrounds ranging from machine learning to musicology. At one point, there was a poster session in which I discussed a signal processing algorithm with one of the presenters, turned around, literally remaining at the same physical location, and then discussed historical music performance practice with the opposite presenter. At this venue, the two parts of my identity which I so far had largely kept apart, turned out to actually work out very well together.

I attended many ISMIRs since, and time and time again, I kept seeing confirmations that a diversity of backgrounds, within attendees and between attendees, was what made the conference strong and inspiring. Whether we identify as researchers in signal processing, machine learning, library sciences, musicology, or psychology, what connects us all is that we look at music (and personally care about music), which we validly can do in parallel, each from our respective dedicated specialisms.

We do not always speak the same professional language, and we may validate in different ways. It requires effort to understand one another, more so than if we would only speak to people within our own niche specializations. But there is a clear willingness to build those bridges, and learn from one another. As one example, this year at ISMIR 2017, I was invited on a panel on the Future of Music-IR research, and each of the panelists was asked what works or research directions outside of the Music-IR community we would recommend for the community to familiarize with. I strongly believe that discussions like this, aiming to expand our horizons, are what we need at conferences…and what truly legitimizes us traveling internationally to exchange academic thoughts with our peers in person.

I also have always found the community extremely supportive in terms of reviewing. Even in case of rejections, one would usually receive a constructive review back, with multiple concrete pointers for improvements. Thanks to proactive TPC member actions and extensive reviewer guidelines with examples, the average review length for papers submitted to the ISMIR conference went up from 390 words in 2016 to 448 words in 2017.

As this was the baseline I was originally used to, my surprise was great when I first got confronted with the feared ‘two-line review’…as sadly turned out, that actually turned out the more common type of review in research at large. We recently have been discussing this within the SIGMM community, and in those discussions, more extensive reviewer guidelines seemed to be considered a case of ‘TL;DR’ (‘reviewers are busy enough, they won’t have time to read that’). But this is a matter of how we want our academic culture to be. Of course, a thorough and constructive review needs more time commitment than a two-line review, and this may become a problem in situations of high reviewer load. But rather than silently trying to hack the problem as individual reviewers (with more mediocre attention as likely consequence), maybe we should be more consciously selective of what we can handle, and openly discuss it with the community in case we run into capacity issues.

Back to the ISMIR community, more institutionally, inclusion has become a main focus point now. In terms of gender inclusion, a strong Women in MIR (WiMIR) group emerged in the past years, enabling an active mentoring program, and arranging for travel grant sponsoring to support conference attendance of female researchers. But impact reaches beyond gender inclusion. WiMIR also introduced a human bingo at its receptions, for which conference attendees with various characteristics (e.g. ‘has two degrees’, ‘attended the conference more than five times’, ‘is based in Asia’) need to be identified. A very nice and effective way to trigger ice-breaking activities, and to have attendees actively seeking out people they did not speak with yet. That the responsibility to get included at events should not only fall upon new members, but actively should be championed by the existing ‘insiders’, also recently was emphasized in this great post by Eric Holscher.

So, is ISMIR the perfect academic utopia? No, of course we do have our issues. As a medium-sized community, fostering cross-domain interaction goes well, but having individual specializations gain sufficient momentum needs an explicit outlook beyond our own platform. And we also have some status issues. Our conference, being run by an independent society, is frequently omitted from conference rankings; however, the independence is on purpose, as this will better foster accessibility of the venue towards other disciplines. And with an average acceptance rate around 40%, we often are deemed as ‘not sufficiently selective’…but in my experience, there usually is a narrow band of clear accepts, a narrow band of clear rejects, and a broad grey-zone band in the middle. And in more selective conferences, the clear rejects typically have a larger volume, and are much worse in quality, than the worst submissions I have ever seen at ISMIR.

In any case, given the ongoing discussions about SIGMM conferences, multidisciplinarity and inclusion, I felt that sharing some thoughts and observations from this neighboring community would be useful.

And…I really look forward already to serving as a general co-chair of ISMIR’s 20th anniversary in 2019—which will be exactly 10 years after my first, shy debut in the field.


About the Column

The Multidisciplinary Column is edited by Cynthia C. S. Liem and Jochen Huber. Every other edition, we will feature an interview with a researcher performing multidisciplinary work, or a column of our own hand. For this edition, we feature a column by Cynthia C. S. Liem.

Dr. Cynthia C. S. Liem is an Assistant Professor in the Multimedia Computing Group of Delft University of Technology, The Netherlands, and pianist of the Magma Duo. She initiated and co-coordinated the European research project PHENICX (2013-2016), focusing on technological enrichment of symphonic concert recordings with partners such as the Royal Concertgebouw Orchestra. Her research interests consider music and multimedia search and recommendation, and increasingly shift towards making people discover new interests and content which would not trivially be retrieved. Beyond her academic activities, Cynthia gained industrial experience at Bell Labs Netherlands, Philips Research and Google. She was a recipient of the Lucent Global Science and Google Anita Borg Europe Memorial scholarships, the Google European Doctoral Fellowship 2010 in Multimedia, and a finalist of the New Scientist Science Talent Award 2016 for young scientists committed to public outreach.

Dr. Jochen Huber is a Senior User Experience Researcher at Synaptics. Previously, he was an SUTD-MIT postdoctoral fellow in the Fluid Interfaces Group at MIT Media Lab and the Augmented Human Lab at Singapore University of Technology and Design. He holds a Ph.D. in Computer Science and degrees in both Mathematics (Dipl.-Math.) and Computer Science (Dipl.-Inform.), all from Technische Universität Darmstadt, Germany. Jochen’s work is situated at the intersection of Human-Computer Interaction and Human Augmentation. He designs, implements and studies novel input technology in the areas of mobile, tangible & non-visual interaction, automotive UX and assistive augmentation. He has co-authored over 60 academic publications and regularly serves as program committee member in premier HCI and multimedia conferences. He was program co-chair of ACM TVX 2016 and Augmented Human 2015 and chaired tracks of ACM Multimedia, ACM Creativity and Cognition and ACM International Conference on Interface Surfaces and Spaces, as well as numerous workshops at ACM CHI and IUI. Further information can be found on his personal homepage: http://jochenhuber.com

The Deep Learning Indaba Report

Abstract

Given the focus on deep learning and machine learning, there is a need to address this problem of low participation of Africans in data science and artificial intelligence. The Deep Learning Indaba was thus born to stimulate the participation of Africans within the research and innovation landscape surrounding deep learning and machine learning. This column reports on the Deep Learning Indaba event, which consisted of a 5-day series of introductory lectures on Deep Learning, held from 10-15 September 2017, coupled with tutorial sessions where participants gained practical experience with deep learning software packages. The column also includes interviews with some of the organisers to learn more about the origin and future plans of the Deep Learning Indaba.

Introduction

Africans have a low participation in the area of science called deep learning and machine learning, as shown by the fact that at the 2016 Neural Information Processing Systems (NIPS’16) conference, none of the accepted papers had at least one author from a research institution in Africa (http://www.deeplearningindaba.com/blog/missing-continents-a-study-using-accepted-nips-papers).

Given the increasing focus on deep learning, and the more general area of machine learning, there is a need to address this problem of low participation of Africans in the technology that underlies the recent advances in data science and artificial intelligence that is set to transform the way the world works. The Deep Learning Indaba was thus born, aiming to be a series of master classes on deep learning and machine learning for African researchers and technologists. The purpose of the Deep Learning Indaba was to stimulate the participation of Africans, within the research and innovation landscape surrounding deep learning and machine learning.

What is an ‘indaba’?

According to the organisers ‘indaba’ is a Zulu word that simply means gathering or meeting. There are several words for such meetings (that are held throughout southern Africa) including an imbizo (in Xhosa), an intlanganiso, and a lekgotla (in Sesotho), a baraza (in Kiswahili) in Kenya and Tanzania, and padare (in Shona) in Zimbabwe. Indabas have several functions: to listen and share news of members of the community, to discuss common interests and issues facing the community, and to give advice and coach others. Using the word ‘indaba’ for the Deep Learning event connects it to other community gatherings that are similarly held by cultures throughout the world. The Deep Learning Indaba is about the spirit of coming together, of sharing and learning and is one of the core values of the event.

The Deep Learning Indaba

After a couple of months of furious activity by the organisers, roughly 300 students, researchers and machine learning practitioners from all over Africa gathered for the first Deep Learning Indaba from 10-15 September 2017 at the University of Witswatersrand, Johannesburg, South Africa. More than 30 African countries were represented for an intense week of immersion into Deep Learning.

The Deep Learning Indaba consisted of a 5-day series of introductory lectures on Deep Learning, coupled with tutorial sessions where participants gained practical experience with deep learning software packages such as TensorFlow. The format of the Deep Learning Indaba was based on the intense summer school experience of NIPS. Presenters at the Indaba included prominent figures in the machine learning community such as Nando de Freitas, Ulrich Paquet and Yann Dauphin. The lecture sessions were all recorded and all the practical tutorials are also available online: Lectures and Tutorials.

After organising the first successful Deep Learning Indaba in Africa (a report on the outcomes of the Deep Learning Indaba can be found at online), the organisers have already started planning the next two Deep Learning Indabas, that will take place in 2018 and 2019. More information can be found at the Deep Learning Indaba website http://www.deeplearningindaba.com.

Having been privileged to attend this first Deep Learning Indaba, a number of the organisers were interviewed to learn more about the origin and future plans of the Deep Learning Indaba. The interviewed organisers include Ulrich Paquet and Stephan Gouws.

Question 1: What was the origin of the Deep Learning Indaba?

Ulrich Paquet: We’d have to dig into history a bit here, as the dream of taking ICML (International Conference on Machine Learning) to South Africa has been around for a while. The topic was again raised at the end of 2016, when Shakir and I sat at NIPS (Conference on Neural Information Processing Systems), and said “let’s find a way to make something happen in 2017.” We were waiting for the right opportunity. Stephan has been thinking along these lines, and so has George Konidaris. I met Benjamin Rosman in January or February over e-mail, and within a day we were already strategizing what to do.

We didn’t want to take a big conference to South Africa, as people parachute in and out, without properly investing in education. How can we make the best possible investment in South African machine learning? We thought a summer school would be the best vehicle, but more than that, we wanted a summer school that would replicate the intense NIPS experience in South Africa: networking, parties, high-octane teaching, poster sessions, debates and workshops…

Shakir asked Demis Hassibis for funding in February this year, and Demis was incredibly supportive. And that got the ball rolling…

Stephan Gouws: It began with the question that was whispered amongst many South Africans in the machine learning industry: “how can we bring ICML to South Africa?” Early in 2017, Ulrich Paquet and Shakir Mohamed (both from Google DeepMind) began a discussion regarding how a summer school-like event can be held in South Africa. A summer school-like event was chosen as it typically has a bigger impact after the event than a typical conference. Benjamin Rosman (from the South African Council of Scientific and Industrial Research), Nando de Freitas (also from Google DeepMind) joined the discussion in February. A fantastic group of researchers from South Africa was gathered that shared the vision of making the event a reality. I suggested the name “Deep Learning Indaba”, we registered a domain, and from there we got the ball rolling!

Question 2: What did the organisers want to achieve with the Indaba?

Ulrich Paquet: Strengthening African Machine Learning

“a shared space to learn, to share, and to debate the state-of-the-art in machine learning and artificial intelligence”

  • Teaching and mentoring
  • Building a strong research community
  • Overcoming isolation

We also wanted to work towards inclusion; build a community; confidence building; affect government policy.

Stephan Gouws: Our vision is to strengthen machine learning in Africa. Machine learning experts, workshop and conferences are mostly concentrated in North America and Western-Europe. African do not easily get the opportunity to be exposed to such events as they are far away, expensive to attend, etc. Furthermore, with a conference a bunch of experts fly in, discuss the state-of-the-art of the field, and then fly away. A conference does not easily allow for a transfer of expertise, and therefore the local community does not gain much from a conference. With the Indaba, we hoped to facility a knowledge transfer (for which a summer school-like event is better suited), and also to create networking opportunities for students, industry, academics and the international presenters.

Question 3: Why was the Indaba held in South Africa?

Ulrich Paquet: All of the (original) organizers are South African, and really care about development of their own country. We want to reach beyond South Africa, though, and tried to include as many institutions as possible (more than 20 African countries were represented).

But, one has to remember that the first Indaba was essentially an experiment. We had to start somewhere! We benefit by having like-minded local organizers :)

Stephan Gouws: All the organisers are originally from South Africa and want to support and strengthen the machine learning field in South Africa (and eventually in the rest of Africa).

Question 4: What was the expectations beforehand for the Indaba? (For example, how many people did the organisers expect will attend?)

Ulrich Paquet: Well, we originally wanted to run a series of master classes for 40 students. We had ABSOLUTELY NO idea how many students would apply, or if any would even apply. We were very surprised when we hit more than 700 applications by our deadline, and by then, the whole game changed. We couldn’t take 40 out of 700, and decided to go for the largest lecture hall we could possibly find (for 300 people).

There are then other logistics of scale that come into play: feeding everyone, transporting everyone, running practical sessions, etc. And it has to be within budget!! The cap at 300 seemed to work well.

Question 5: Are there any plans for the future of the Indaba? Are you planning on making it an annual event?

Ulrich Paquet: Yes, definitely.

Stephan Gouws: Nothing official yet, but the plan from the beginning was to make it an annual event.

[Editor]:  The Deep Learning Indaba 2018 has since been announced and more information can be found at the following link: http://www.deeplearningindaba.com/indaba-2018.html.  The organisers have also announced locally organised, one-day Indabas to be held from 26 March to 6 April 2108 with the aim of strengthening the African Machine learning community. Details for obtaining support for the organising of an IndabaX event can be found at the main site: http://www.deeplearningindaba.com/indabax

Question 6: How can students, researchers and people from industry still get and stay involved after the Indaba?

Ulrich Paquet: There are many things that could be changed with enough critical mass. One, that we’re hoping, is to ensure that the climate for research in sub-Saharan Africa is as fertile as possible. This will only happen through lots of collaboration and cross-pollination. There are some things that stand in the way of this kind of collaboration. One is government KPIs (key performance indicators) that rewards research: for AI, it does not rightly reward collaboration, and does not rightly reward publications in top-tier platforms, which are all conferences (NIPS, ICML). Therefore, it does not reward playing in and contributing to the most competitive playing field. These are all things that the AI community in SA should seek to creatively address and change.

We have seen organic South African papers published at UAI and ICML for the first time this year, and the next platforms should be JMLR and NIPS, and then Nature. There’s never been any organic Africa AI or machine learning papers in any of the latter venues. Students should be encouraged to collaborate and submit to them! The nature of the game is that the barrier to entry for these venues is so high, that one has to collaborate… This of course brings me to my point about why research grants (in SA) should be revisited to reflect these outcomes.

Stephan Gouws: In short, yes. All the practical, lectures and videos are made publicly available. There is also Facebook and WhatsApp groups, and we hope that the discussion and networking will not stop after the 15th of September. As a side note: I am working on ideas (more aimed at postgraduate students) to eventually put a mentor system in place, as well as other types of support for postgraduate students after the Indaba. But it is still early days and only time will tell.

Biographies of Interviewed Organisers

Ulrich Paquet (Research Scientist, DeepMind, London):

Ulrich Paquet

Dr. Ulrich Paquet is a Research Scientist at DeepMind, London. He really wanted to be an artist before stumbling onto machine learning while attending a third-year course taught at University of Pretoria (South Africa) where he eventually obtained a Master’s degree in Computer Science. In April 2007 Ulrich obtained his PhD from the University of Cambridge with dissertation topic “Bayesian Inference for Latent Variable Models.” After obtaining his PhD he worked with a start-up called Imense, focusing on face recognition and image similarity search. He then joined Microsoft’s FUSE Labs, based at Microsoft Research Cambridge, where he eventually worked on the XBox-One launch as part of the Xbox Recommendations team. From 2015 he joined another start-up in Cambridge, VocalIQ, which has been acquired by Apple before joining DeepMind in April 2016.

Stephan Gouws (Research Scientist, Google Brain Team):

Stephan Gouws

Dr. Stephan Gouws is a Research Scientist at Google and part of the Google Brain Team that developed TensorFlow and Google’s Neural Machine Translation System. His undergraduate studies was a double major in Electronic Engineering and Computer Science at Stellenbosch University (South Africa). His postgraduate studies in Electronic Engineering were also completed at the MIH Media Lab at Stellenbosch University. He obtained his Master’s degree cum laude in 2010 and his PhD degree in 2015 on the dissertation topic of “Training Neural Word Embeddings for Transfer Learning and Translation.” During his PhD he spent one year at Information Sciences Institute (ISI) at the University of Southern California in Los Angeles, and 1 year at Montreal Institute for Learning Algorithms where he worked closely with Yoshua Bengio. He also worked as Research Intern at both Microsoft Research and Google Brain during this period.

 
The Deep Learning Indaba Organisers:

Shakir Mohamed (Research Scientist, DeepMind, London)
​Nyalleng Moorosi (Researcher, Council for Scientific and Industrial Research, South Africa)
Ulrich Paquet (Research Scientist, DeepMind, London)
​Stephan Gouws (Research Scientist, Google, Brain Team, London)
Vukosi Marivate (Researcher, Council for Scientific and Industrial Research, South Africa)
Willie Brink (Senior Lecturer, Stellenbosch University, South Africa)
Benjamin Rosman (Researcher, Council for Scientific and Industrial Research, South Africa)
Richard Klein (Associate Lecturer, University of the Witwatersrand, South Africa)

Advisory Committee:

Nando De Freitas (Research Scientist, DeepMind, London)
Ben Herbst (Professor, Stellenbosch University)
Bonolo Mathibela (Research Scientist, IBM Research South Africa)
​George Konidaris (Assistant Professor, Brown University)​
​Bubacarr Bah (Research Chair, African Institute for Mathematical Sciences, South Africa)

MPEG Column: 120th MPEG Meeting in Macau, China

The original blog post can be found at the Bitmovin Techblog and has been updated here to focus on and highlight research aspects.

MPEG Plenary Meeting

MPEG Plenary Meeting

The MPEG press release comprises the following topics:

  • Point Cloud Compression – MPEG evaluates responses to call for proposal and kicks off its technical work
  • The omnidirectional media format (OMAF) has reached its final milestone
  • MPEG-G standards reach Committee Draft for compression and transport technologies of genomic data
  • Beyond HEVC – The MPEG & VCEG call to set the next standard in video compression
  • MPEG adds better support for mobile environment to MMT
  • New standard completed for Internet Video Coding
  • Evidence of new video transcoding technology using side streams

Point Cloud Compression

At its 120th meeting, MPEG analysed the technologies submitted by nine industry leaders as responses to the Call for Proposals (CfP) for Point Cloud Compression (PCC). These technologies address the lossless or lossy coding of 3D point clouds with associated attributes such as colour and material properties. Point clouds are referred to as unordered sets of points in a 3D space and typically captured using various setups of multiple cameras, depth sensors, LiDAR scanners, etc., but can also be generated synthetically and are in use in several industries. They have recently emerged as representations of the real world enabling immersive forms of interaction, navigation, and communication. Point clouds are typically represented by extremely large amounts of data providing a significant barrier for mass market applications. Thus, MPEG has issued a Call for Proposal seeking technologies that allow reduction of point cloud data for its intended applications. After a formal objective and subjective evaluation campaign, MPEG selected three technologies as starting points for the test models for static, animated, and dynamically acquired point clouds. A key conclusion of the evaluation was that state-of-the-art point cloud compression can be significantly improved by leveraging decades of 2D video coding tools and combining 2D and 3D compression technologies. Such an approach provides synergies with existing hardware and software infrastructures for rapid deployment of new immersive experiences.

Although the initial selection of technologies for point cloud compression has been concluded at the 120th MPEG meeting, it could be also seen as a kick-off for its scientific evaluation and various further developments including the optimization thereof. It is expected that various scientific conference will focus on point cloud compression and may open calls for grand challenges like for example at IEEE ICME 2018.

Omnidirectional Media Format (OMAF)

The understanding of the virtual reality (VR) potential is growing but market fragmentation caused by the lack of interoperable formats for the storage and delivery of such content stifles VR’s market potential. MPEG’s recently started project referred to as Omnidirectional Media Format (OMAF) has reached Final Draft of International Standard (FDIS) at its 120th meeting. It includes

  • equirectangular projection and cubemap projection as projection formats;
  • signalling of metadata required for interoperable rendering of 360-degree monoscopic and stereoscopic audio-visual data; and
  • provides a selection of audio-visual codecs for this application.

It also includes technologies to arrange video pixel data in numerous ways to improve compression efficiency and reduce the size of video, a major bottleneck for VR applications and services. The standard also includes technologies for the delivery of OMAF content with MPEG-DASH and MMT.

MPEG has defined a format comprising a minimal set of tools to enable interoperability among implementers of the standard. Various aspects are deliberately excluded from the normative parts to foster innovation leading to novel products and services. This enables us — researcher and practitioners — to experiment with these new formats in various ways and focus on informative aspects where typically competition can be found. For example, efficient means for encoding and packaging of omnidirectional/360-degree media content and its adaptive streaming including support for (ultra-)low latency will become a big issue in the near future.

MPEG-G: Compression and Transport Technologies of Genomic Data

The availability of high throughput DNA sequencing technologies opens new perspectives in the treatment of several diseases making possible the introduction of new global approaches in public health known as “precision medicine”. While routine DNA sequencing in the doctor’s office is still not current practice, medical centers have begun to use sequencing to identify cancer and other diseases and to find effective treatments. As DNA sequencing technologies produce extremely large amounts of data and related information, the ICT costs of storage, transmission, and processing are also very high. The MPEG-G standard addresses and solves the problem of efficient and economical handling of genomic data by providing new

  • compression technologies (ISO/IEC 23092-2) and
  • transport technologies (ISO/IEC 23092-1),

which reached Committee Draft level at its 120th meeting.

Additionally, the Committee Drafts for

  • metadata and APIs (ISO/IEC 23092-3) and
  • reference software (ISO/IEC 23092-4)

are scheduled for the next MPEG meeting and the goal is to publish Draft International Standards (DIS) at the end of 2018.

This new type of (media) content, which requires compression and transport technologies, is emerging within the multimedia community at large and, thus, input is welcome.

Beyond HEVC – The MPEG & VCEG Call to set the Next Standard in Video Compression

The 120th MPEG meeting marked the first major step toward the next generation of video coding standard in the form of a joint Call for Proposals (CfP) with ITU-T SG16’s VCEG. After two years of collaborative informal exploration studies and a gathering of evidence that successfully concluded at the 118th MPEG meeting, MPEG and ITU-T SG16 agreed to issue the CfP for future video coding technology with compression capabilities that significantly exceed those of the HEVC standard and its current extensions. They also formalized an agreement on formation of a joint collaborative team called the “Joint Video Experts Team” (JVET) to work on development of the new planned standard, pending the outcome of the CfP that will be evaluated at the 122nd MPEG meeting in April 2018. To evaluate the proposed compression technologies, formal subjective tests will be performed using video material submitted by proponents in February 2018. The CfP includes the testing of technology for 360° omnidirectional video coding and the coding of content with high-dynamic range and wide colour gamut in addition to conventional standard-dynamic-range camera content. Anticipating a strong response to the call, a “test model” draft design is expected be selected in 2018, with development of a potential new standard in late 2020.

The major goal of a new video coding standard is to be better than its successor (HEVC). Typically this “better” is quantified by 50% which means, that it should be possible encode the video at the same quality with half of the bitrate or a significantly higher quality with the same bitrate including. However, at this time the “Joint Video Experts Team” (JVET) from MPEG and ITU-T SG16 faces competition from the Alliance for Open Media, which is working on AV1. In any case, we are looking forward to an exciting time frame from now until this new codec is ratified and how it will perform compared to AV1. Multimedia systems and applications will also benefit from new codecs which will gain traction as soon as first implementations of this new codec becomes available (note that AV1 is available as open source already and continuously further developed).

MPEG adds Better Support for Mobile Environment to MPEG Media Transport (MMT)

MPEG has approved the Final Draft Amendment (FDAM) to MPEG Media Transport (MMT; ISO/IEC 23008-1:2017), which is referred to as “MMT enhancements for mobile environments”. In order to reflect industry needs on MMT, which has been well adopted by broadcast standards such as ATSC 3.0 and Super Hi-Vision, it addresses several important issues on the efficient use of MMT in mobile environments. For example, it adds distributed resource identification message to facilitate multipath delivery and transition request message to change the delivery path of an active session. This amendment also introduces the concept of a MMT-aware network entity (MANE), which might be placed between the original server and the client, and provides a detailed description about how to use it for both improving efficiency and reducing delay of delivery. Additionally, this amendment provides a method to use WebSockets to setup and control an MMT session/presentation.

New Standard Completed for Internet Video Coding

A new standard for video coding suitable for the internet as well as other video applications, was completed at the 120th MPEG meeting. The Internet Video Coding (IVC) standard was developed with the intention of providing the industry with an “Option 1” video coding standard. In ISO/IEC language, this refers to a standard for which patent holders have declared a willingness to grant licenses free of charge to an unrestricted number of applicants for all necessary patents on a worldwide, non-discriminatory basis and under other reasonable terms and conditions, to enable others to make, use, and sell implementations of the standard. At the time of completion of the IVC standard, the specification contained no identified necessary patent rights except those available under Option 1 licensing terms. During the development of IVC, MPEG removed from the draft standard any necessary patent rights that it was informed were not available under such Option 1 terms, and MPEG is optimistic of the outlook for the new standard. MPEG encourages interested parties to provide information about any other similar cases. The IVC standard has roughly similar compression capability as the earlier AVC standard, which has become the most widely deployed video coding technology in the world. Tests have been conducted to verify IVC’s strong technical capability, and the new standard has also been shown to have relatively modest implementation complexity requirements.

Evidence of new Video Transcoding Technology using Side Streams

Following a “Call for Evidence” (CfE) issued by MPEG in July 2017, evidence was evaluated at the 120th MPEG meeting to investigate whether video transcoding technology has been developed for transcoding assisted by side data streams that is capable of significantly reducing the computational complexity without reducing compression efficiency. The evaluations of the four responses received included comparisons of the technology against adaptive bit-rate streaming using simulcast as well as against traditional transcoding using full video re-encoding. The responses span the compression efficiency space between simulcast and full transcoding, with trade-offs between the bit rate required for distribution within the network and the bit rate required for delivery to the user. All four responses provided a substantial computational complexity reduction compared to transcoding using full re-encoding. MPEG plans to further investigate transcoding technology and is soliciting expressions of interest from industry on the need for standardization of such assisted transcoding using side data streams.

MPEG currently works on two related topics which are referred to as network-distributed video coding (NDVC) and network-based media processing (NBMP). Both activities involve the network, which is more and more evolving to highly distributed compute and delivery platform as opposed to a bit pipe, which is supposed to deliver data as fast as possible from A to B. This phenomena could be also interesting when looking at developments around 5G, which is actually much more than just radio access technology. These activities are certainly worth to monitor as it basically contributes in order to make networked media resources accessible or even programmable. In this context, I would like to refer the interested reader to the December’17 theme of the IEEE Computer Society Computing Now, which is about Advancing Multimedia Content Distribution.


Publicly available documents from the 120th MPEG meeting can be found here (scroll down to the end of the page). The next MPEG meeting will be held in Gwangju, Korea, January 22-26, 2018. Feel free to contact Christian Timmerer for any questions or comments.


Some of the activities reported above are considered within the Call for Papers at 23rd Packet Video Workshop (PV 2018) co-located with ACM MMSys 2018 in Amsterdam, The Netherlands. Topics of interest include (but are not limited to):

  • Adaptive media streaming, and content storage, distribution and delivery
  • Network-distributed video coding and network-based media processing
  • Next-generation/future video coding, point cloud compression
  • Audiovisual communication, surveillance and healthcare systems
  • Wireless, mobile, IoT, and embedded systems for multimedia applications
  • Future media internetworking: information-centric networking and 5G
  • Immersive media: virtual reality (VR), augmented reality (AR), 360° video and multi-sensory systems, and its streaming
  • Machine learning in media coding and streaming systems
  • Standardization: DASH, MMT, CMAF, OMAF, MiAF, WebRTC, MSE, EME, WebVR, Hybrid Media, WAVE, etc.
    Applications: social media, game streaming, personal broadcast, healthcare, industry 4.0, education, transportation, etc.

Important dates

  • Submission deadline: March 1, 2018
  • Acceptance notification: April 9, 2018
  • Camera-ready deadline: April 19, 2018

JPEG Column: 77th JPEG Meeting in Macau, China

IMG_1670r050

JPEG XS is now entering the final phases of the standard definition and soon will be available. It is important to clarify the change on the typical JPEG approach, as this is the first JPEG image compression standard that is not developed only targeting the best compression performance for the best perceptual quality. Instead, JPEG XS establishes a compromise between compression efficiency and low complexity. This new approach is also complemented with the development of a new part for the well-established JPEG 2000, named High Throughput JPEG 2000.

With these initiatives, JPEG committee is standardizing low complexity and low latency codecs, with a slight sacrifice of the compression performance usually seek in previous standards. This change of paradigm is justified considering the current trends on multimedia technology with the continuous grow on devices that are usually highly dependent of battery life cycles, namely mobiles, tablets, and also augmented reality devices or autonomous robots. Furthermore this standard provides support for applications like Omnidirectional video capture or real time video storage and streaming applications. Nowadays, networks tend to grow in available bandwidth. The memory available in most devices has also been reaching impressive numbers. Although compression is still required to simplify the large amount of data manipulation, its performance might become secondary if kept into acceptable levels. As it is obvious, considering the advances in coding technology of the last 25 years, these new approaches define codecs with compression performances largely above the JPEG standard used in most devices today. Moreover, they provide enhanced capabilities like HDR support, lossless or near lossless modes, or alpha plane coding.

On the 77th JPEG meeting held in Macau, China, from 21st to 27th of October several activities have been considered, as shortly described in the following.

IMG_3037r025

  1. A call for proposals on JPEG 360 Metadata for the current JPEG family of standards has been issued.
  2. New advances on low complexity/low latency compression standards, namely JPEG XS and High Throughput JPEG 2000.
  3. Continuation of JPEG Pleno project that will lead to a family of standards on different 3D technologies, like light fields, digital holography and also point clouds data.
  4. New CfP for the Next-Generation Image Compression Standard.
  5. Definition of a JPEG reference software.

Moreover, a celebration of the 25th JPEG anniversary where early JPEG committee members from Asia have been awarded has taken place.

The different activities are described in the following paragraphs.

 

JPEG Privacy and Security

JPEG Privacy & Security is a work item (ISO/IEC 19566-4) aiming at developing a standard that provides technical solutions, which can ensure privacy, maintaining data integrity and protecting intellectual property rights (IPR). A Call for Proposals was published in April 2017 and based on descriptive analysis of submitted solutions for supporting protection and authenticity features in JPEG files, a working draft of JPEG Privacy & Security in the context of JPEG Systems standardization was produced during the 77th JPEG meeting in Macau, China. To collect further comments from the stakeholders in this filed, an open online meeting for JPEG Privacy & Security will be conducted before the 78th JPEG meeting in Rio de Janeiro, Brazil, on Jan. 27-Feb 2, 2018. JPEG Committee invites interested parties to the meeting. Details will be announced in the JPEG Privacy & Security AhG email reflector.

 

JPEG 360 Metadata

The JPEG Committee has issued a “Draft Call for Proposals (CfP) on JPEG 360 Metadata” at the 77th JPEG meeting in Macau, China. The JPEG Committee notes the increasing use of multi-sensor images from multiple image sensor devices, such as 360 degree capturing cameras or dual-camera smartphones available to consumers. Images from these cameras are shown on computers, smartphones and Head Mounted Displays (HMDs). JPEG standards are commonly used for image compression and file format to store and share such content. However, because existing JPEG standards do not fully cover all new uses, incompatibilities have reduced the interoperability of 360 images, and thus reduce the widespread ubiquity, which consumers have come to expect when using JPEG-based images. Additionally, new modalities for interaction with images, such as computer-based augmentation, face-tagging, and object classification, require support for metadata that was not part of the scope of the original JPEG. To avoid fragmentation in the market and to ensure interoperability, a standard way of interaction with multi-sensor images with richer metadata is desired in JPEG standards. This CfP invites all interested parties, including manufacturers, vendors and users of such devices to submit technology proposals for enabling interactions with multi-sensor images and metadata that fulfill the scope, objectives and requirements.

 

High Throughput JPEG 2000

The JPEG Committee is continuing its work towards the creation of a new Part 15 to the JPEG 2000 suite of standards, known as High Throughput JPEG 2000 (HTJ2K).

Since the release of an initial Call for Proposals (CfP) at the outcome of its 76th meeting, the JPEG Committee has completed the software test bench that will be used to evaluate technology submissions, and has reviewed initial registrations of intent. Final technology submissions are due on 1 March 2018.

The HTJ2K activity aims to develop an alternate block-coding algorithm that can be used in place of the existing block coding algorithm specified in ISO/IEC 15444-1 (JPEG 2000 Part 1). The objective is to significantly increase the throughput of JPEG 2000, at the expense of a small reduction in coding efficiency, while allowing mathematically lossless transcoding to and from codestreams using the existing block coding algorithm.

 

JPEG XS

This project aims at the standardization of a visually lossless low-latency lightweight compression scheme that can be used as a mezzanine codec for the broadcast industry, Pro-AV and other markets. Targeted use cases are professional video links, IP transport, Ethernet transport, real-time video storage, video memory buffers, and omnidirectional video capture and rendering. After four rounds of Core Experiments, the Core Coding System has now been finalized and the ballot process has been initiated.

Additional parts of the Standard are still being specified, in particular future profiles, as well as transport and container formats. The JPEG Committee therefore invites interested parties – in particular coding experts, codec providers, system integrators and potential users of the foreseen solutions – to contribute to the further specification process. Publication of the International Standard is expected for Q3 2018.

 

JPEG Pleno

This standardization effort is targeting the generation of a multimodal framework for the exchange of light field, point cloud, depth+texture and holographic data in end-to-end application chains. Currently, the JPEG Committee is defining the coding framework of the light field modality for which the signalling syntax will be specified in part 2 of the JPEG Pleno standard. In parallel, JPEG is reaching out to companies and research institutes that are active in the point cloud and holography arena and invites them to contribute to the standardization effort. JPEG is seeking for additional input both at the level of test data and quality assessment methodologies for this specific type of image modalities as technology that supports their generation, reconstruction and/or rendering.

 

JPEG XL

The JPEG Committee has launched a Next-Generation Image Compression Standardization activity, also referred to as JPEG XL. This activity aims to develop a standard for image compression that offers substantially better compression efficiency than existing image formats (e.g. >60% over JPEG-1), along with features desirable for web distribution and efficient compression of high-quality images.

The JPEG Committee intends to issue a final Call for Proposals (CfP) following its 78th meeting (January 2018), with the objective of seeking technologies that fulfill the objectives and scope of the Next-Generation Image Compression Standardization activity.

A draft Call for Proposals, with all related info, has been issued and can be found in JPEG website. Comments are welcome and should be submitted as specified in the document.

To stay posted on the action plan for JPEG XL, please regularly consult our website at jpeg.org and/or subscribe to our e-mail reflector. You will receive information to confirm your subscription, and upon the acceptance of the moderator will be included in the mailing-list.

 

JPEG Reference Software

Along with its celebration of the 25th anniversary of the commonly known JPEG still image compression specifications, The JPEG Committee has launched an activity to fill a long-known gap in this important image coding standard, namely the definition of a JPEG reference software. For its 77th meeting, The JPEG Committee collected submissions for a reference software that were evaluated for suitability, and started now the standardization process of such software on the basis of submissions received.


IMG_1670r050

JPEG 25th anniversary of the first JPEG standard

The JPEG Committee had a 25th anniversary celebration of its first standard in Macau specifically organized to honour past committee members from Asia, and was proud to award Takao Omachi for his contributions to the first JPEG standard, Fumitaka Ono for his long lasting contributions to JBIG and JPEG standards, and Daniel Lee for contributions to JPEG family of standards and long lasting services as Convenor of the JPEG Committee. The celebrations of the anniversary of this successful standard that is still growing in its use after 25th years will have a third and final event during the 79th JPEG meeting planned in La Jolla, CA, USA.

JPEG77annivers25

 

Final Quote

“JPEG is committed to design of specifications that ensure privacy and other security and protection solutions across the entire JPEG family of standards” said Prof. Touradj Ebrahimi, the Convener of the JPEG committee. 

 

About JPEG

The Joint Photographic Experts Group (JPEG) is a Working Group of ISO/IEC, the International Organisation for Standardization / International Electrotechnical Commission, (ISO/IEC JTC 1/SC 29/WG 1) and of the International Telecommunication Union (ITU-T SG16), responsible for the popular JBIG, JPEG, JPEG 2000, JPEG XR, JPSearch and more recently, the JPEG XT, JPEG XS, JPEG Systems and JPEG Pleno families of imaging standards.

The JPEG group meets nominally three times a year, in Europe, North America and Asia. The latest 75th    meeting was held on March 26-31, 2017, in Sydney, Australia. The next (76th) JPEG Meeting will be held on July 15-21, 2017, in Torino, Italy.

More information about JPEG and its work is available at www.jpeg.org or by contacting Antonio Pinheiro and Frederik Temmermans of the JPEG Communication Subgroup at pr@jpeg.org.

If you would like to stay posted on JPEG activities, please subscribe to the jpeg-news mailing list on https://listserv.uni-stuttgart.de/mailman/listinfo/jpeg-news.  Moreover, you can follow JPEG twitter account on http://twitter.com/WG1JPEG

Future JPEG meetings are planned as follows:

  • No 78, Rio de Janeiro, Brazil, January 27 to February 2, 2018
  • No 79, La Jolla (San Diego), CA, USA, April 9 to 15, 2018
  • No 80, Berlin, Germany, July 7 to 13, 2018

How Do Ideas Flow around SIGMM Conferences?

Figure 1. The citation flow for ACM Multimedia (1993-2015). Summary of incoming vs outgoing citations to the top 25 venues in either direction. Node colors: ratio of citations (outgoing ideas, red) vs references (incoming ideas, blue). Node sizes: amount of total citation+references in either direction. Thickness of blue edges are scaled by the number of references going to a given venue; thickness of red edges are scaled by the number of citations coming from a given venue. Nodes are sorted left-to-right by the ratio of incoming vs outgoing citations to this conference.

 

The ACM Multimedia conference just celebrated its quarter century in October 2017. This is a great opportunity to reflect on the intellectual influence of the conference, and the SIGMM community in general.

The progress on big scholarly data allows us to make this task analytical. I download a data dump from  Microsoft Academic Graph (MAG) in February 2016. I find all papers from ACM Multimedia (MM), the SIGMM flagship conference — there are 4,346 publication entries from 1993 to 2015. I then search the entire MAG for: (1) any paper that appears in the reference list of these MM papers – 35,829 entries across 1,560 publication venues (including both journals and conferences), with an average of 8.24 per paper; (2) any paper that cites any of these MM papers – 46826 citations from 1694 publication venues, with an average of 10.77 citations per paper.

This data allows us to profile the incoming (references) and outgoing (citations) influence in the community in detail. In this article, we highlight two questions below.

Where are the intellectual influences of the SIGMM community coming from, and going to?

If you have been publishing in, and going to SIGMM conference(s) for a while, you may wonder where the ideas presented today would have its influence after 5, 10, 20 years? You may also wonder if the ideas cross over to other fields and disciplines, and which stay and flourish within the SIGMM community. You may also wonder whether the influence flow has changed since you entered the community, 3, 5, 10, or 20+ years ago.

If you are new to SIGMM, you may wonder what this community’s intellectual heritage is. For new students or researchers who recently entered this area, you may wonder what other publication venues are you likely to find work relevant to multimedia.

Figure 1. The citation flow for ACM Multimedia (1993-2015). Summary of incoming vs outgoing citations to the top 25 venues in either direction. Node colors: ratio of citations (outgoing ideas, red) vs references (incoming ideas, blue). Node sizes: amount of total citation+references in either direction. Thickness of blue edges are scaled by the number of references going to a given venue; thickness of red edges are scaled by the number of citations coming from a given venue. Nodes are sorted left-to-right by the ratio of incoming vs outgoing citations to this conference.

Figure 1. The citation flow for ACM Multimedia (1993-2015). Summary of incoming vs outgoing citations to the top 25 venues in either direction. Node colors: ratio of citations (outgoing ideas, red) vs references (incoming ideas, blue). Node sizes: amount of total citation+references in either direction. Thickness of blue edges are scaled by the number of references going to a given venue; thickness of red edges are scaled by the number of citations coming from a given venue. Nodes are sorted left-to-right by the ratio of incoming vs outgoing citations to this conference.

A summary of this information is found in the “citation flower” graph above, summarising the incoming and outgoing influence since the inception of ACM MM (1993-2015).

On the right of the “citation flower” we can see venues that have had more influence in MM than otherwise, these include computer vision and pattern recognition (CVPR, ICCV, ECCV, T-PAMI, IJCV), machine learning (NIPS, JMLR, ICML), networking and systems (INFOCOM), information retrieval (SIGIR), human-computer interaction (CHI) as well as related journals (IEEE Multimedia). The diversity of incoming influence is part of the SIGMM identity, as the community has always been a place where ideas from disparate areas meet and generate interesting solutions to problems as well as generating new challenges. As indicated by the break down over time (on a separate page), the incoming influence of CVPR is increasing, and that of IEEE Trans. Circuits Systems on Video Technology is decreasing — this is consistent with video encoding technology maturing over the last two decades, and computer vision being fast-evolving currently.

On the left of the “citation flower”, we can see that ACM MM has been a major influencer for a variety of multimedia venues — from conferences (ICME, MIR, ICMR, CIVR) to journals (Multimedia Tools and Applications, IEEE Trans. Multimedia), to journals in related areas (IEEE Trans. On Knowledge Discovery and Engineering).

How many papers are remembered in the collective memory of the academic community and for how long?

Or, as a heated post-conference beer conversation may put it: are 80% of the papers forgotten in 2 years? Spoiler alert: no, for most conferences we looked at; but about 20% tend not be cited at all.

Figure 2. Fraction of ACM MM papers that are cited at least once more than X years after they are published, with a linear regression overlay.

Figure 2. Fraction of ACM MM papers that are cited at least once more than X years after they are published, with a linear regression overlay.

In Figure 2, we see a typical linear decline of the fraction of papers being cited. For example, 53% of papers have at least one citation after being published for 10 years. There are multiple factors that affect the shape of this citation survival graph, such as the size of this research community, the turnover rate of ideas (fast-moving or slow-moving), the perceived quality of publications, and others. See here for a number of different survival curves in different research communities.

What about the newer, and more specialised SIGMM conferences?

Figure 3 and Figure 4 show the citation flowers for ICMR and MMSys, both conferences have had five years of publication data in MAG. We can see that both conferences are well-embedded among the SIGMM and related venues (ACM Multimedia, IEEE Trans. Multimedia), both have strong influence from the computer vision community including T-PAMI, CVPR and ICCV. The sub-community specific influences are coming from WWW, ICML NIPS for ICMR; and INFOCOM, SIGMETRICS, SIGMAR for MMSys. In terms of out-going influence, MMSys influences venues in networking (ICC, CoNEXT), and ICMR influences Information Science and MMSys.

Figure 3. The citation flow for ICMR (2011-2015). See Figure 1 caption for the meaning of node/edge colors and sizes.

Figure 3. The citation flow for ICMR (2011-2015). See Figure 1 caption for the meaning of node/edge colors and sizes.

Figure 4. The citation flow for MMSys (2011-2015). See Figure 1 caption for the meaning of node/edge colors and sizes.

Figure 4. The citation flow for MMSys (2011-2015). See Figure 1 caption for the meaning of node/edge colors and sizes.

Overall, this case study shows the truly multi-disciplinary nature of SIGMM, the community should continue the tradition of fusing ideas and strive to increase its influence in other communities.  

I hope you find these analyzes and observations somewhat useful, and I would love to hear comments and suggestions from the community.  Of course, the data is not perfect, and there is a lot more to do. The project overview page [1] contains details about data processing and several known issues, software for this analysis and visualisation are also released publicly [2].

Acknowledgements

I thank Alan Smeaton and Pablo Cesar for encouraging this post and many helpful editing suggestions. I also thank Microsoft Academic for making data available.

References

[1] Visualizing Citation Patterns of Computer Science Conferences, Lexing Xie, Aug 2016,  http://cm.cecs.anu.edu.au/post/citation_vis/

[2] Repository for analyzing citation flow https://github.com/lexingxie/academic-graph

 

Practical Guide to Using the YFCC100M and MMCOMMONS on a Budget

 

The Yahoo-Flickr Creative Commons 100 Million (YFCC100M), the largest freely usable multimedia dataset to have been released so far, is widely used by students, researchers and engineers on topics in multimedia that range from computer vision to machine learning. However, its sheer volume, one of the traits that make the dataset unique and valuable, can pose a barrier to those who do not have access to powerful computing resources. In this article, we introduce useful information and tools to boost the usability and accessibility of the YFCC100M, including the supplemental material provided by the Multimedia Commons (MMCOMMONS) community. In particular, we provide a practical guide on how to set up a feasible and cost effective research and development environment locally or in the cloud that can access the data without having to download it first.

YFCC100M: The Largest Multimodal Public Multimedia Dataset

Datasets are unarguably one of the most important components of multimedia research. In recent years there was a growing demand for a dataset that was not specifically biased or targeted towards certain topics, sufficiently large, truly multimodal, and freely usable without licensing issues.

The YFCC100M dataset was created to meet these needs and overcome many of the issues affecting existing multimedia datasets. It is, so far, the largest publicly and freely available multimedia collection of metadata representing about 99.2 million photos and 0.8 million videos, all of which were uploaded to Flickr between 2004 and 2014. Metadata included in the dataset are, for example, title, description, tags, geo-tag, uploader information, capture device information, URL to the original item. Additional information was later released in the form of expansion packs to supplement the dataset, namely autotags (presence of visual concepts, such as people, animals, objects, events, architecture, and scenery), Exif metadata, and human-readable place labels. All items in the dataset were published under one of the Creative Commons commercial or noncommercial licenses, whereby approximately 31.8% of the dataset is marked for commercial use and 17.3% has the most liberal license that only requires attribution to the photographer. For academic purposes, the entire dataset can be used freely, which enables fair comparisons and reproducibility of published research works.

Two articles from the people who created the dataset, YFCC100M: The New Data in Multimedia Research and Ins and Outs of the YFCC100M give more detail about the the motivation, collection process, and interesting characteristics and statistics about the dataset. Since its initial release in 2014, the YFCC100M quickly gained popularity and is widely used in the research community. As of September 2017, the dataset had been requested over 1400 times and cited over 300 times in research publications with topics ranging in multimedia from computer vision to machine learning. Specific topics include, but are not limited to, image and video search, tag prediction, captioning, learning word embeddings, travel routing, event detection, and geolocation prediction. Demos that use the YFCC100M can be found here.

Figure 1. Overview diagram of YFCC100M and Multimedia Commons.

Figure 1. Overview diagram of YFCC100M and Multimedia Commons.


MMCOMMONS: Making YFCC100M More Useful and Accessible

Out of the many things that the YFCC100M offers, its sheer volume is what makes it especially valuable, but it is also what makes the dataset not so trivial to work with. The metadata alone spans 100 million lines of text and is 45GB in size, not including the expansion packs. To work with the images and/or videos of YFCC100M, they need to be downloaded first using the individual URLs contained in the metadata. Aside from the time required to download all 100 million items, which would further occupy 18TB of disk space, the main problem is that a growing number of images and videos is becoming unavailable due to the natural lifecycle of digital items, where people occasionally delete what they have shared online. In addition, the time alone to process and analyze images and videos is generally infeasible for students and scientists in small research groups who do not have access to high performance computing resources.

These issues were noted upon the creation of the dataset and the MMCOMMONS community was formed to coordinate efforts for making the YFCC100M more useful and accessible to all, and to persist the contents of the dataset over time. To that end, MMCOMMONS provides an online repository that holds supplemental material to the dataset, which can be mounted and used to directly process the dataset in the cloud. The images and videos included in the YFCC100M can be accessed and even downloaded freely from an AWS S3 bucket, which was made possible courtesy of the Amazon Public Dataset program. Note that a tiny percentage of images and videos are missing from the bucket, as they already had disappeared when organizers started the download process right after the YFCC100M was published. This notwithstanding, the images and videos hosted in the bucket still serve as a useful snapshot that researchers can use to ensure proper reproduction of and comparison with their work. Also included in the Multimedia Commons repository are visual and aural features extracted from the image and video content. The MMCOMMONS website provides a detailed description of conventional features and deep features, which include HybridNet, VGG and VLAD. These CNN features can be a good starting point for those who would like to jump right into using the dataset for their research or application.

The Multimedia Commons has been supporting multimedia researchers by generating annotations (see the YLI Media Event Detection and MediaEval Placing tasks), developing tools, as well as organizing competitions and workshops for ideas exchange and collaboration.

Setting up a Research Environment for YFCC100M and MMCOMMONS

Even with pre-extracted features available, to do meaningful research one still needs a lot of computing power to process the large amount of YFCC100M and MMCOMMONS data. We would like to lower the barrier of entry for students and scientists who don’t have access to dedicated high-performance resources. In the following we describe how one can easily set up a research environment for handling the large collection. We introduce how Apache MXNet, Amazon EC2 Spot Instance and AWS S3 can be used to create a research development environment that can handle the data in a cost-efficient way, as well as other ways to use it more efficiently.

1) Use a subset of dataset

It is not necessary to work with the entire dataset just because you can. Depending on the use case, it may make more sense to use a well-chosen subset. For instance, the YLI-GEO and YLI-MED subsets released by the MMCOMMONS can be useful for geolocation and multimedia event detection tasks, respectively. For other needs, the data can be filtered to generate a customized subset.

The YFCC100M Dataset Browser is a web-based tool you can use to search the dataset by keyword. It provides an interactive visualization with statistics that helps to better understand the search results. You can generate a list file (.csv) of the items that match the search query, which you can then use to fetch the images and/or videos afterwards. The limitations of this browser are that it only supports keyword search on the tags and that it only accepts ASCII text as valid input, as opposed to UNICODE for queries using non-Roman characters. Also, queries can take up to a few seconds to return results.

A more flexible way to search the collection with lower latency is to set up your own Apache Solr server and indexing (a subset of) the metadata. For instance, the autotags metadata can be indexed to search for images that have visual concepts of interest. A step-by-step guide to setting up a Solr server environment with the dataset can be found here. You can write Solr queries in most programming languages by using one of the Solr wrappers.

2) Work directly with data from AWS S3

Apache MXNet, a deep learning framework you can run locally on your workstation, allows training with S3 data. Most training and inference modules in MXNet accept data iterators that can read data from and write data to a local drive as well as AWS S3.

The MMCOMMONS provides a data iterator for YFCC100M images, stored as a RecordIO file, so you can process the images in the cloud without ever having to download them to your computer. If you are working with a subset that is sufficiently large, you can further filter it to generate a custom RecordIO file that suits your needs. Since the images stored in the RecordIO file are already resized and saved compactly, generating a RecordIO from an existing RecordIO file by filtering on-the-fly is more time and space efficient than downloading all images first and creating a RecordIO file from scratch. However, if you are using a subset that is relatively small, it is recommended to download just those images you need from S3 and then create a RecordIO file locally, as that will considerably speed up processing the data.

While one would generally set up Apache MXNet to run locally, you should note that the I/O latency of using S3 data can be greatly reduced if you would set it up to run on an Amazon EC2 instance in the same region as where the S3 data is stored (namely, us-west-2, Oregon), see Figure 2. Instructions for setting up a deep learning environment on Amazon EC2 can be found here.

Figure 2. The diagram shows a cost-efficient setup with a Spot Instance in the same region (us-west-2) as the S3 buckets that houses YFCC100M and MMCOMMONS images/videos and RecordIO files. Data in the S3 buckets can be accessed in a same way from researcher’s computer; the only downside with this is the longer latency for retrieving data from S3. Note that there are several Yahoo! Webscope buckets (I3set1-I3setN) that hold a copy of the YFCC100M, but you only can access it using the path you were assigned after requesting the dataset.

Figure 2. The diagram shows a cost-efficient setup with a Spot Instance in the same region (us-west-2) as the S3 buckets that houses YFCC100M and MMCOMMONS images/videos and RecordIO files. Data in the S3 buckets can be accessed in a same way from researcher’s computer; the only downside with this is the longer latency for retrieving data from S3. Note that there are several Yahoo! Webscope buckets (I3set1-I3setN) that hold a copy of the YFCC100M, but you only can access it using the path you were assigned after requesting the dataset.


3) Save cost by using Amazon EC2 Spot Instances

Cloud computing has become considerably cheaper in recent years. However, the price for using a GPU instance to process the YFCC100M and MMCOMMONS can still be quite expensive. For instance, Amazon EC2’s on-demand p2.xlarge instance (with a NVIDIA TESLA K80 GPU and 12GB RAM) costs 0.9 USD per hour in the us-west-2 region. This would cost approximately $650 (€540) a month if used full-time.

One way to reduce the cost is to set up a persistent Spot Instance environment. If you request an EC2 Spot Instance, you can use the instance as long as its market price is below your maximum bidding price. If the market price goes beyond your maximum bid, the instance gets terminated after a two minutes warning. To deal with such frequent interruptions it is important to store your intermediate results often to persistent storage space, such as AWS S3 or AWS EFS. The market price of the EC2 instance fluctuates, see Figure 3, so there is no guarantee as to how much you can save or how long you have to wait for your final results to be ready. But if you are willing to experiment with pricing, in our case we were able to reduce the costs by 75% during the period January-April 2017.

Figure 3. You can check the current and past market price of different EC2 instance types from the Spot Instance Pricing History panel.

Figure 3. You can check the current and past market price of different EC2 instance types from the Spot Instance Pricing History panel.


4) Apply for academic AWS credits

Consider applying for the AWS Cloud Credits for Research Program to receive AWS credits to run your research in the cloud. In fact, thanks to the grant we were able to release LocationNet, a pre-trained geolocation model that used all geotagged YFCC100M images.

Conclusion

YFCC100M is at the moment the largest multimedia dataset released to the public, but its sheer volume poses a high barrier to actually use it. To boost the usability and accessibility of the dataset, the MMCOMMONS community provides an additional AWS S3 repository with tools, features, and annotations to facilitate creating a feasible research development environment for those with fewer resources at their disposal. In this column, we provided a guide on how a subset of the dataset can be created for specific scenarios, how the hosted YFCC100M and MMCOMMONS data on S3 can be used directly for training a model with Apache MXNet, and finally how Spot Instances and academic AWS credits can make running experiments cheaper or even free.

Join the Multimedia Commons Community

Please let us know if you’re interested in contributing to the MMCOMMONS. This is a collaborative effort among research groups at several institutions (see below). We welcome contributions of annotations, features, and tools around the YFCC100M dataset, and may potentially be able to host them on AWS. What are you working on?

See the this page for information about how to help out.

Acknowledgements:

This dataset would not have been possible without the effort of many people, especially those at Yahoo, Lawrence Livermore National Laboratory, International Computer Science Institute, Amazon, ISTI-CNR, and ITI-CERTH.