Experience

Executive Director

February/2022 – Present

Developed and deployed production-level artificial intelligence infrastructure for building knowledge graphs from data

Description:

  • Designed the Human-Aware Data Acquisition (HADatAc) Framework, which is an open-source infrastructure used for value-level curation, management, integration, and harmonization of scientific data through the capture and preservation of rich connections between data and metadata (i.e., the data’s contextual knowledge) under a single knowledge graph.
  • The highly scalable knowledge graph in HADatAc is based on a polyglot data management solution composed of the following: a NoSQL database to manage data; a graph database to manage complex metadata; and an ontology-based web application that semantically annotates and connects graph elements.
  • Developing knowledge graph in support of the Mental Health Care systems in collaboration with the Universidade Federal de Minas Gerais, Brazil
  • Developing the Madeira Knowledge Graph aimed to be the most comprehensive repository of Environmental Health knowledge related to the Madeira Island in Portugal. This work is done with a number of local collaborators in the Madeira Island including the University of Madeira and IPMA and external collaborators like members of the Open Street Map community and of the HADatAd.org community. The goal is to leverage EU funding to support this project.

Rensselaer Polytechnic Institute

Director Data Operations & Senior Research Scientist

January/2014 – January/2022

Developed and deployed production-level artificial intelligence infrastructure for building knowledge graphs from data

Description:

  • Personally bootstrapped the development of HADatAc’s code for six months and made the academic case for the infrastructure. Still using a hands-on approach, established two coordinated, collaborative development teams for HADatAc: one at RPI’s Tetherless World Constellation group with eight members, and one team at Yale-Center for Ecosystems in Architecture (Yale-CEA) with four members.
  • The infrastructure is currently an essential component of high-impact, data-intensive projects: the data center for the National Institutes of Health’s Children’s Health Exposure Analysis Resource (CHEAR) Project hosted at the Mount Sinai School of Medicine in New York City; the Gates Foundation’s Healthy Birth, Growth, and Development-Knowledge Integration (HBGD-KI) Program hosted at Amazon AWS; and the Yale-CEA‘s partnership with the United Nations-Environment developing the Ecological Living Module, also hosted at Amazon AWS.
  • Whenever possible, code has been deployed with the help of Docker. Web application code is based on a combination of programming languages, including: Scala, Java, Javascript, and Python. Data management code is based on Lucene queries for NoSQL operations and on SPARQL queries for graph operations. Predicates of graph operations are from World Wide Web Consortium (W3c)’s standard ontologies: XML Schema for primitive types, Resource Description Framework (RDF) for graph elements, Ontology Web Language (OWL) for ontology management, PROV for provenance, and Simple Knowledge Organization System (SKOS) for knowledge organization. The use of standard vocabularies assures better interoperability between HADatAc-based applications’ data and external data repositories.

Pacific Northwest National Laboratory

Chief Scientist at Scientific Data Management Group (Level V)

2011 – 2013

Introduced the use of artificial intelligence (semantic) techniques to capture and leverage data provenance during the computation of data image analytics

Description:

  • The PNNL Chemical Imaging Initiative (CII) is advancing the way knowledge is extracted from large-scale high-throughput observation, experiment, and stochastic simulation data. Within CII, established the CII Provenance project to systematically leverage the use of data provenance to improve the automation of image analytics, including enhancements of analytic processes like image reconstruction, image segmentation, and feature detection.
  • The project expanded PNNL’s data pipeline originally designed to handle high-resolution mass spectrometry data samples with an average of 40GB per sample collection to handle dynamic transmission electron microscope (DTEM) data with an average of 1TBs per sample collection.
  • The CII Provenance project leveraged PNNL’s cloud-based Velo Data Management Facility interface to preserve and manage provenance knowledge. The data image analytics was based on PNNL’s image data pipeline.
  • The data infrastructure of Velo was developed on top of the Alfresco content management system, and the project introduced the use of triple stores to preserve image data provenance.
  • Data provenance has been essential for the analytics of DTEM images since it enables the identification of material signatures in images required for the selection of reaction front images out of the millions of images generated during a single chemical reaction. Services were developed to manage logical connections between the Velo data repository and the provenance repository.
  • PNNLs’ matrix organizational structure was used to internally recruit and manage software engineers helping with the development of user interfaces, visualization services, and visual analytic services. The development of these services required collaborations with experts from a broad range of domains including biochemistry, material sciences, molecular biology, mass spectrometry, electron microscopy, and high-energy physics.

University of Texas at El Paso

Associate Professor of Computer Science with tenure
Assistant Professor

2011 – 2012 2006 – 2011

Co-Founded and led the Computer Science (CS) component of the National Science-Foundation Funded CyberShare Center of Excellence during the first five years of the center. The CS component was the center’s team connecting CS to Environmental Sciences, Earth Sciences, and Computational Sciences.    

Description:

  • Established the UTEP Trust Laboratory within the CyberShare Center and recruited 20+ graduate students and four staff members. The CS component introduced the use of semantic technologies, including RDF triple stores and SPARQL queries, to the students.
  • During the first two years of the center, it was more important to create a culture of properly using semantic technology within the group than to focus on software development. The required cultural change was accomplished through the development of many internal workshops and the introduction of semantic content into the regular Computer Science curriculum in courses like Databases, Software Engineering, and Information Integration.
  • The students were able to successfully develop a new generation of intelligent data management and provenance capture tools used by scientists in the following projects: National Science Foundation (NSF)-funded Geosciences Network (GEON) and EarthScope; internationally funded Circumartic Environmental Observatories Network (CEON); major research agencies such as the U.S. National Aeronautics and Space Agency (NASA), the U.S. National Center for Atmospheric Research (NCAR), and Australia’s Commonwealth Scientific and Industrial Research Organisation (CSIRO).

Stanford University

Postdoctoral Fellow

2002 – 2005

Co-Founded the Inference Web initiative, which is a foundation of W3C PROV standard for provenance interoperability. Designed and implemented the Inference Web infrastructure, including the Proof Markup Language (PML) API and the PML explanation component. Inference Web was the explanation component of DARPA’s Personal Assistant that Learns (PAL) Project and the Intelligence Advanced Research Projects Agency’s (IARPA’s) Novel Intelligence for Massive Data (NIMD) Project. Apple’s SIRI was a spin-off technology of the PAL Project (authored three publications with SIRI’s founders), and IBM’s Watson was a spin-off technology of NIMD Project (authored four publications with key IBM Watson’s program manager and the team lead).

COPASA-MG

Lead Data Manager
Sr. Software Engineer

1994 – 1998 1992 – 1994

Developed a simulation model that was used to predict the performance of a new system during the migration of more than 1.6 million lines of code, three months before the new system became operational. The predicted results identified performance issues that were fixed before the system migration. Planned and managed the first deployment of a large-scale Geographical Information System (twenty million customers) in a South America’s water utility company.