My research spans on requirements engineering, privacy engineering, and natural language processing, focusing on development of methods and tools for comprehending natural language requirements and formal representation of this knowledge.
Privacy policies mainly focus on abstract information types when describing data practices. Apart from abstraction, various stakeholders can use different words to describe the same domain concept. Abstraction and variability in concept representation are the main causes of ambiguity in natural language and reduce the shared understanding among app developers and policy writers. To overcome this obstacle, I developed formal ontologies on privacy policies that can help policy authors, app developers, and regulators to consistently check how data practice descriptions relate to one another and to identify unintended interpretations. Arranging terminology into a hierarchical organization (i.e., an ontology) captures relations, such as subclass/superclass, part/whole, and synonymy, among categories. Constructing an empirically-valid ontology is a challenging task since it should be both scalable and consistent with multi-user interpretations. During my doctorate program, I constructed and extended privacy ontologies using heuristics, natural language patterns, and neural networks, which I now discuss.
I proposed a manual approach through which two analysts compare two information types extracted from privacy policies and assign a semantic relationship. The relationship assignment is based on seven heuristics that were discovered from content analysis and grounded theory of five privacy policies. As a study, we extracted information types related to automatic data collection practices from 50 policies, yielding 351 information types. As an analyst, I assigned relationships to pairs constructed from 351 information types. I compared the results with another analyst, calculating the inter-rater reliability and reconciling the differences to achieve high agreement for assignments. Finally, I formalized the relationships between pairs in an ontology using Description Logic. We also extracted user-provided information from data collection practices in privacy policies of 20 apps in health, finance, and dating domains. We applied a similar approach to construct three different ontologies for these domains. Further, we used these ontologies to detect misalignments between privacy policies and app code.
As a motivating example, consider a policy snippet of app A: "We collect information, including device information and device ID." For constructing an ontology, we first extract "device information" and "device ID." Using heuristics, we infer subclass relationship between these two information types, which is added to the ontology. We can apply this ontology to check the policy and code alignment in app B, which collects "device ID" and discloses collection of "device information" in its policy. Using the ontology, we can acknowledge the company's disclosure of the information collection. Yet, raise a warning on the use of a phrase that is too general, which can potentially mislead the user or disregard policy regulation regulations.
Our research group at UTSA implemented an open-source tool suite called PoliDroid for developers to test their code in real-time for potential misalignments and generate code-based privacy specifications using the ontology. My contributions to this project is three folds: (1) Identifying data collection statements using part-of-speech tags and bag-of-words; (2) identifying missing information types in data collection statements when compared with code; (3) raising a warning if an abstract term for missing information type is mentioned, and a violation otherwise.
To improve the previous method, I developed a semi-automated syntax-based method that employs a shallow typology and 26 regular expressions to categorize individual words in information types and parse them for inferring semantic relations. We discovered these patterns through grounded analysis of information types extracted from 50 privacy policies. We effectively evaluated the discovered patterns using crowdsourcing. To formalize the patterns, I further introduced a context-free grammar (CFG) to parse information types. Under a generative treatment of morphology, the CFG produces variants of a given phrase, which are necessary to build an ontology and overcome the abstraction problem in natural text. Finally, I developed a tool that augments CFG with semantic attachments to infer relations. For example, the method can derive "IP address" as a morphological variant from "device IP address," and infer part-whole relationship between them. The semantic attachments are formally presented using Lambda claculus. This method fails to infer semantic relations between some information types, such as "mobile device" and "iPhone," due to lack of embedded tacit knowledge.
Naturalness of Software: Researchers argue that natural language may seem complex and powerful, yet largely regular and predictable. Similarly, programming languages are highly complex, but the fact that they are written by programmers (real people) makes them mostly repetitive. Using this property, statistical learning models can be applied to pattern learning and prediction problems in the coding domain. Extracting knowledge about what information types are collected in mobile applications is one of the major problems in code analysis. Previous research has proposed to extract this information by analyzing Android API calls within the application code. However, applications can collect sensitive personal information through user input fields in user interfaces, which does not require the use of the Android API. To identify the information types being collected through user input fields, I plan to apply statistical learning models to decompiled application code as one aspect of my future work.