Mitra Bokaei Hosseini
North Paseo Building - NPB 2.228, University of Texas at San Antonio

Research Statement

My research spans on requirements engineering, privacy engineering, and natural language processing, focusing on development of methods and tools for comprehending natural language requirements and formal representation of this knowledge.

Several state laws, along with app markets, such as Apple's App Store and Google Play, require app developers to provide users with legal privacy notices (privacy policy) containing critical requirements that inform users about what kinds of personal information is collected, how the data is used, and with whom the data is shared. Because privacy policies consist of legal terms often written by a legal team without rigorous insight into the app source code, and because the policy and app code can change independently, privacy policies become misaligned with the actual data practices. In addition to misinforming users, such inconsistencies between policies and app code can have legal repercussions.

My research goal is to capture and formalize the semantics of natural language privacy policies into a knowledge base that can be applied to misalignment detection tools, thus enabling policy authors and app developers to tailor the privacy policy and app code. Application of this knowledge base can actuate (1) transparent software implementation for users; and (2) shared understanding between policy authors, app developers, and regulators.

There are three major challenges to my work. First, identifying trace links between natural language privacy policies and app code cannot be achieved without addressing abstraction and variability within privacy policy concept representation as well as vague and unbound information collected through input fields in app code. Second, designing and evaluating solutions for these problems requires datasets in the mobile app privacy policy domain. Third, my solutions should be generalizable and scalable considering both the growing number of apps available in the market and the evolving privacy-related regulations.

Formal Representation of Privacy Policy Semantics

Privacy policies mainly focus on abstract information types when describing data practices. Apart from abstraction, various stakeholders can use different words to describe the same domain concept. Abstraction and variability in concept representation are the main causes of ambiguity in natural language and reduce the shared understanding among app developers and policy writers. To overcome this obstacle, I developed formal ontologies on privacy policies that can help policy authors, app developers, and regulators to consistently check how data practice descriptions relate to one another and to identify unintended interpretations. Arranging terminology into a hierarchical organization (i.e., an ontology) captures relations, such as subclass/superclass, part/whole, and synonymy, among categories. Constructing an empirically-valid ontology is a challenging task since it should be both scalable and consistent with multi-user interpretations. During my doctorate program, I constructed and extended privacy ontologies using heuristics, natural language patterns, and neural networks, which I now discuss.

I proposed a manual approach through which two analysts compare two information types extracted from privacy policies and assign a semantic relationship. The relationship assignment is based on seven heuristics that were discovered from content analysis and grounded theory of five privacy policies. As a study, we extracted information types related to automatic data collection practices from 50 policies, yielding 351 information types. As an analyst, I assigned relationships to pairs constructed from 351 information types. I compared the results with another analyst, calculating the inter-rater reliability and reconciling the differences to achieve high agreement for assignments. Finally, I formalized the relationships between pairs in an ontology using Description Logic. We also extracted user-provided information from data collection practices in privacy policies of 20 apps in health, finance, and dating domains. We applied a similar approach to construct three different ontologies for these domains. Further, we used these ontologies to detect misalignments between privacy policies and app code.

As a motivating example, consider a policy snippet of app A: "We collect information, including device information and device ID." For constructing an ontology, we first extract "device information" and "device ID." Using heuristics, we infer subclass relationship between these two information types, which is added to the ontology. We can apply this ontology to check the policy and code alignment in app B, which collects "device ID" and discloses collection of "device information" in its policy. Using the ontology, we can acknowledge the company's disclosure of the information collection. Yet, raise a warning on the use of a phrase that is too general, which can potentially mislead the user or disregard policy regulation regulations.

Our research group at UTSA implemented an open-source tool suite called PoliDroid for developers to test their code in real-time for potential misalignments and generate code-based privacy specifications using the ontology. My contributions to this project is three folds: (1) Identifying data collection statements using part-of-speech tags and bag-of-words; (2) identifying missing information types in data collection statements when compared with code; (3) raising a warning if an abstract term for missing information type is mentioned, and a violation otherwise.

To improve the previous method, I developed a semi-automated syntax-based method that employs a shallow typology and 26 regular expressions to categorize individual words in information types and parse them for inferring semantic relations. We discovered these patterns through grounded analysis of information types extracted from 50 privacy policies. We effectively evaluated the discovered patterns using crowdsourcing. To formalize the patterns, I further introduced a context-free grammar (CFG) to parse information types. Under a generative treatment of morphology, the CFG produces variants of a given phrase, which are necessary to build an ontology and overcome the abstraction problem in natural text. Finally, I developed a tool that augments CFG with semantic attachments to infer relations. For example, the method can derive "IP address" as a morphological variant from "device IP address," and infer part-whole relationship between them. The semantic attachments are formally presented using Lambda claculus. This method fails to infer semantic relations between some information types, such as "mobile device" and "iPhone," due to lack of embedded tacit knowledge.

To infer relationships based on tacit knowledge, I developed a convolutional neural network method which classifies semantic relations between a pair of information types from privacy policy requirements by considering their semantic similarities. We empirically used this approach to extend an existing ontology, which was further applied in misalignment detection framework leading to: (a) identification of additional misalignments; and (b) reduction of the time and effort required to identify abstract information types and the concepts they represent.

Ongoing and Future Work

Privacy Policy Ontology: Currently, I am preparing two papers on ontology construction methods using (1) context-free grammars and syntax analysis to infer variations of information types and semantic relations and (2) convolutional neural networks to classify the relationship between two given information types. In future, I foresee defining the ontology construction problem in privacy as a paraphrasing task, where the goal is to predict a paraphrase template for a given information type presented using a noun compound.

Third-Party Analysis: App-developing companies often compromise costs through sharing end user data. For instance, a free app may collect its users' personally identifiable information for marketing purposes thereby amortizing the cost of app development over a period of time with eventual profit. General Data Protection Regulation (GDPR), Articles 13.1(e) and 14.1(e), is enforcing transparency of data practices with respect to recipients or categories of recipients of the personal data. California Consumer Privacy Act, 1798.110.c, also recommends business that collect data shall disclose categories of third-parties. We envision utilizing an annotated privacy policy dataset to train named entity extraction models to extract third-party names and caterories. Further, the results can be used to validate the compliance of apps' data practices with their privacy policies.

Naturalness of Software: Researchers argue that natural language may seem complex and powerful, yet largely regular and predictable. Similarly, programming languages are highly complex, but the fact that they are written by programmers (real people) makes them mostly repetitive. Using this property, statistical learning models can be applied to pattern learning and prediction problems in the coding domain. Extracting knowledge about what information types are collected in mobile applications is one of the major problems in code analysis. Previous research has proposed to extract this information by analyzing Android API calls within the application code. However, applications can collect sensitive personal information through user input fields in user interfaces, which does not require the use of the Android API. To identify the information types being collected through user input fields, I plan to apply statistical learning models to decompiled application code as one aspect of my future work.

Privacy Policy Generation: In practice, an app's code and privacy policy often have orthogonal life-cycles since they are not developed and maintained by a central entity in an enterprise. Apps often go through a rapid, cyclic "develop, test, and deploy" process in today's Agile culture, while policies go through a much slower legal process. This is primarily due to the lack of techniques and tools that allow code-level changes to rapidly propagate to the corresponding privacy policy at the right level of semantic abstraction. In the future, I plan to address this challenge by proposing template-based privacy policy generation and maintenance, wherein app code changes yield continuous updates of the policy.