Conference Papers

Mota P., Eskenazi M., Coheur L.
Natural Language Engineering
2018
Abstract:
Research on topic segmentation has recently focused on segmenting documents by taking advantage of documents covering the same topics. In order to properly evaluate such approaches, a dataset of related documents is needed. However, existing datasets are limited in the number of related documents per domain. In addition, most of the available datasets do not consider documents from different media sources (PowerPoints, videos, etc.), which pose specific challenges to segmentation. We fill this gap with the MUltimedia SEgmentation Dataset (MUSED), a collection of documents manually segmented, from different media sources, in seven different domains, with an average of twenty related documents per domain. In this paper, we describe the process of building MUSED. A multi-annotator study is carried out to determine if it is possible to observe agreement among human judges and characterize their disagreement patterns. In addition, we use MUSED to compare the state-of-the-art topic segmentation techniques, including the ones that take advantage of related documents. Moreover, we study the impact of having documents from different media sources in the dataset. To the best of our knowledge, MUSED is the first dataset that allows a straightforward evaluation of both single- and multiple-documents topic segmentation techniques, as well as to study how these behave in the presence of documents from different media sources. Results show that some techniques are, indeed, sensitive to different media sources, and also that current multi-document segmentation models do not outperform previous models, pointing to a research line that needs to be boosted.
Gupta V., Kim J., Pandya A., Lakshmanan K., Rajkumar R., Tovar E.
2011 8th Annual IEEE Communications Society Conference on Sensor, Mesh and Ad Hoc Communications and Networks, SECON 2011
2011
Abstract:
Wireless Sensor Networks (WSN) are being used for a number of applications involving infrastructure monitoring, building energy monitoring and industrial sensing. The difficulty of programming individual sensor nodes and the associated overhead have encouraged researchers to design macro-programming systems which can help program the network as a whole or as a combination of subnets. Most of the current macro-programming schemes do not support multiple users seamlessly deploying diverse applications on the same shared sensor network. As WSNs are becoming more common, it is important to provide such support, since it enables higher-level optimizations such as code reuse, energy savings, and traffic reduction. In this paper, we propose a macro-programming framework called Nano-CF, which, in addition to supporting in-network programming, allows multiple applications written by different programmers to be executed simultaneously on a sensor networking infrastructure. This framework enables the use of a common sensing infrastructure for a number of applications without the users being concerned about the applications already deployed on the network. The framework also supports timing constraints and resource reservations using the Nano-RK operating system. Nano-CF is efficient at improving WSN performance by (a) combining multiple user programs, (b) aggregating packets for data delivery, and (c) satisfying timing and energy specifications using Rate-Harmonized Scheduling. Using representative applications, we demonstrate that Nano-CF achieves 90% reduction in Source Lines-of-Code (SLoC) and 50% energy savings from aggregated data delivery.
Mota P., Coheur L., Curto S., Fialho P.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
2012
Abstract:
In this paper we target Natural Language Understanding in the context of Conversational Agents that answer questions about their topics of expertise, and have in their knowledge base question/answer pairs, limiting the understanding problem to the task of finding the question in the knowledge base that will trigger the most appropriate answer to a given (new) question. We implement such an agent and different state of the art techniques are tested, covering several paradigms, and moving from lab experiments to tests with real users. First, we test the implemented techniques in a corpus built by the agent’s developers, corresponding to the expected questions; then we test the same techniques in a corpus representing interactions between the agent and real users. Interestingly, results show that the best “lab” techniques are not necessarily the best for real scenarios, even if only in-domain questions are considered.
Zejnilovic S., Gomes J., Sinopoli B.
European Signal Processing Conference
2014
Abstract:
In order to quickly curb infections or prevent spreading of rumors, first the source of diffusion needs to be localized. We analyze the problem of source localization, based on infection times of a subset of nodes in incompletely observed tree networks, under a simple propagation model. Our scenario reflects the assumption that having access to all the nodes and full network topology is often not feasible. We evaluate the number of possible topologies that are consistent with the observed incomplete tree. We provide a sufficient condition for the selection of observed nodes, such that correct localization is possible, i.e. the network is observable. Finally, we formulate the source localization problem under these assumptions as a binary linear integer program. We then provide a small simulation example to illustrate the effect of the number of observed nodes on the problem complexity and on the number of possible solutions for the source.
Zejnilovic S., Gomes J., Sinopoli B.
2013 51st Annual Allerton Conference on Communication, Control, and Computing, Allerton 2013
2013
Abstract:
Identifying the patient-zero of an epidemic outbreak, locating the person who started a rumor in a social network, finding the computer that initiated the spreading of a computer virus in a network- these are all applications of localizing the source of diffusion in a network. Since most of the networks of interest are very large, we are usually able to observe only a part of the network. In this paper, we first present a model for the dynamics of network diffusion similar to state update of a linear time-varying system. Based on this model, we provide a sufficient condition for observability of the network, i.e., we establish when is the partial information available to us sufficient to uniquely localize the source. Also, we connect the problem of finding the smallest subset of observed nodes to the problem of metric basis of the graph. We then present different methods to perform source localization depending on network observability.
Zejnilovic S., Mitsche D., Gomes J., Sinopoli B.
2014 IEEE Global Conference on Signal and Information Processing, GlobalSIP 2014
2014
Abstract:
Localizing a source of diffusion is a crucial task in various applications such as epidemics quarantine and identification of trendsetters in social networks. We analyze the problem of selecting the minimum number of observed nodes that would lead to unambiguous source localization, i.e. achieve network observability, when both infection times of all the nodes, as well as the network structure cannot be fully observed. Under a simple propagation scenario, we model the assumption that, while the structure of local communities is well known, the connections between different communities are often unobserved. We present a necessary and sufficient condition for the minimum number of observed nodes in networks where all components have either a tree, a grid, a cycle or a complete graph structure. Additionally, we provide a sufficient condition for the selection of observed nodes when the components are of arbitrary structure. Through simulation, we illustrate the performance of the proposed bound.
Gupta V., Pereira N., Gaur S., Tovar E., Rajkumar R.
RTCSA 2014 - 20th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications
2014
Abstract:
Support for multiple concurrent applications is an important enabler for promoting the use of sensor networks as an infrastructure technology, where multiple users can deploy their applications independently. In such a scenario, different applications on a node may transmit packets at distinct periods, causing the node to change from sleep to active state more often, which negatively impacts the energy consumption of the whole network. In this paper, we propose to batch the transmissions together by defining a harmonizing period to align the transmissions from multiple applications at periodic boundaries. This harmonizing period is then leveraged to design a protocol that coordinates the transmissions across nodes and provides real-time guarantees in a multi-hop network. This protocol, which we call Network- Harmonized Scheduling (NHS), takes advantage of the periodicity introduced to assign offsets to nodes at different hop-levels such that collisions are always avoided, and deterministic behavior is enforced. NHS is a light-weight and distributed protocol that does not require any global state-keeping mechanism. We implemented NHS on the Contiki operating system and show how it can achieve a duty-cycle comparable to an ideal TDMA approach.
Martins A.F.T., Figueiredo M.A.T., Aguiar P.M.Q., Smith N.A., Xing E.P.
Proceedings of the 25th International Conference on Machine Learning
2008
Abstract:
Positive definite kernels on probability measures have been recently applied in structured data classification problems. Some of these kernels are related to classic information theoretic quantities, such as mutual information and the Jensen-Shannon divergence. Meanwhile, driven by recent advances in Tsallis statistics, nonextensive generalizations of Shannon’s information theory have been proposed. This paper bridges these two trends. We introduce the Jensen-Tsallis q-difference, a generalization of the Jensen-Shannon divergence. We then define a new family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, Jensen-Shannon, and linear kernels as particular cases. We illustrate the performance of these kernels on text categorization tasks.
Martins A.F.T., Smith N.A., Xing E.P., Aguiar P.M.Q., Figueiredo M.A.T.
Journal of Machine Learning Research
2009
Abstract:
Positive definite kernels on probability measures have been recently applied to classification prob- lems involving text, images, and other types of structured data. Some of these kernels are related to classic information theoretic quantities, such as (Shan non’s) mutual information and the Jensen- Shannon (JS) divergence. Meanwhile, there have been recent advances in nonextensive gener- alizations of Shannon’s information theory. This paper bri dges these two trends by introducing nonextensive information theoretic kernels on probability measures, based on new JS-type diver- gences. These new divergences result from extending the the two building blocks of the classical JS divergence: convexity and Shannon’s entropy. The notion of convexity is extended to the wider concept of q-convexity, for which we prove a Jensen q-inequality. Based on this inequality, we in- troduce Jensen-Tsallis (JT) q-differences, a nonextensive generalization of the JS dive rgence, and define a k-th order JT q-difference between stochastic processes. We then define a n ew family of nonextensive mutual information kernels, which allow weights to be assigned to their arguments, and which includes the Boolean, JS, and linear kernels as particular cases. Nonextensive string kernels are also defined that generalize the p-spectrum kernel. We illustrate the performance of these kernels on text categorization tasks, in which documents are modeled both as bags of words and as sequences of characters.