Big Data and Information Privacy


Sylvana Faja

Silvana Faja
University of Central
Missouri, Department of
Computer Information
Systems and Analytics,
Warrensburg, MO, USA,

Silvana Trimi

Silvana Trimi
University of Nebraska-
Lincoln, Department of
Supply Chain Management
and Analytics

There is no question that Big Data (BD) holds great potential for a wide range of benefits for individuals, businesses, government, healthcare, and society in general. In the advent of the digital age, BD has helped facilitate urban planning, national security, college admissions, hiring or dating, just to mention a few examples. For businesses, big data offers opportunities to achieve competitive advantage through business intelligence, improved decisionmaking and innovation for new products and tailored services.

Big Data has disrupted the traditional information value chain due to numerous new sources of data (e.g., SNS), new ways of collecting data (e.g., smart sensors), and more recently, much advanced analytics such as artificial intelligence and deep learning. Consequently, BD has received a significant amount of attention by researchers and practitioners (Abbasi et al., 2016). However, the focus has been mostly on the technical aspects of BD, including data security, but not enough on people, their social and institutional environments. One area that is increasingly and urgently demanding particular attention is privacy. Data security and privacy is considered by experts as one of the top three issues related to data analytics, along with data quality and new enabling technologies, such as machine learning and AI (Davis, 2018). While most of organizations have focused on security, privacy is becoming a top concern, particularly because of the new EU’s General Data Protection Regulation (GDPR) privacy law going into effect on 25th of May, 2018. While the new rules come from Europe, they impact organizations globally.

The main issue of BD is that its goals and those of privacy are diametrically opposed: BD relies (and therefore, aims) on collection, storage, sharing, and analysis of data; while privacy aims at protecting this data, thus, attempts to minimize most of these processes. While individuals, organizations, and governments are concerned about data privacy, yet they seek the benefits that can be acquired from BD analytics, which may rely on invasion of privacy. This phenomenon has been referred to as the ‘privacy paradox’.


There have been several definitions and conceptualization of information privacy. A well-known information privacy measure in IS research is the Concern for Information Privacy (CFIP) (Smith et al., 1996). This measure consists of four dimensions of privacy concerns: collection, secondary use, access, and errors. ‘Collection’ represents the concern that an excessive amount of data is being collected and stored by organizations. ‘Secondary use’ refers to concern that data that is collected for one purpose is being used for another purpose. The measure further distinguishes between concerns about internal and external secondary use. ‘Errors’ refer to concern about protection measures against accidental and deliberate errors in data handling. ‘Access’ is about the concern that data is readily available to people who are not properly authorized to access the data.

These dimensions, identified more than two decades ago, hold true and have become even more critical in the context of BD, most worrisome being collection and secondary use. While concern about data collection, related to the volume aspect of BD, is most obvious, secondary use is what BD is all about: its value lies in identifying secondary uses of data. With the wide spread use of networked and web-based technologies, other measures of privacy concerns were developed. Malhotra et al. (2004) introduced Internet User’s Information Privacy Concerns (IUIPC). This measure includes three dimensions: control, awareness, and collection. Control refers to the degree at which the individual has control over personal information. Awareness represents the degree to which a consumer is concerned about his/her awareness of organizational information privacy practices. Collection is the degree to which a person is concerned about the amount of individual-specific data that others possess relative to the benefits received. James et al. (2016) introduced the concept of Interpersonal Privacy Identity (IPI) which comprises the information belief and interaction control belief, in other words, the right to control what information is released and to whom. More recently, Kayhan (2016) distinguished between situational privacy and dispositional privacy. Situational privacy concern refers to the concern related to a specific online provider, and varies depending on the context of the interaction. Dispositional privacy concern depends on the individual personality characteristics and their overall concern about privacy. Using the above measurements, significant research has indicated that privacy concerns affect user intention to use technology, thus it is very important to be dealt with.

We are at the beginning of new era of BD: that of artificial intelligence (AI). Analytics tools utilizing machine learning are precursors of tomorrow’s super-intelligent systems, and ultimately of “general AI”, machines that will perform the full range of human cognitive tasks. AI is contributing to societal transformational “3000 times the impact” of the Industrial Revolution (The Economist, 2016). The following breakthroughs are making significant impact on: data collection and aggregation – IoT, ubiquitous devices, and cloud networking; data processing – better business intelligence (BI) and algorithms (AI and especially its subsidiaries, machine learning (ML) and deep learning (DL)); and computing power – better and faster new processors, and cloud and edge computing. However, new technologies bring new challenges. BD technologies advance faster than the chain of systems that preserve information security and privacy, which includes legislation, policies, and processes (Lowry et al., 2017).


In a world where news of security breaches or data privacy violations make frequent headlines, it is not surprising that people are concerned or have ‘data anxiety’ (Pink et al., 2018). This anxiety is related to how people experience the uncertainty about data accuracy, ownership, and (mis)usage.

Organizations have always collected and analyzed customer data in an effort to better understand them. What has changed now is that the technology to collect, store and analyze this data has advanced. While in principle individuals have a choice on whether to disclose or not their personal information, in practice, in the era of smart devices, e-commerce, and social platforms, there is not much choice. Much of users’ data is collected without users being aware of it, not only as they use digital devices, but also as they move through public and personal spaces (Barton et al., 2017). This tracking typically takes place without the informed consent of the user. Information about browsing, games, payment, geo-locations, etc. are being collected without users’ awareness and consent. To use “free” services like Facebook, Google, Instagram or Snapchat, one “pays” with some identifiable information they need to disclose and the large amount of data he/she creates using them. Another crucial feature of modern big data analytics is the concept of collective privacy (Hernstruwer, 2017). Individual users of online services have the right to disclose as much personal information as they like. However, due to the ties among users in social networking sites, for example, people who willingly reveal about themselves increase the chances of disclosing personal information about other users regardless of the other users’ consent.

Furthermore, even though individuals may have consented to secondary use of data, the ways this data could be used is not transparent and seem to grow exponentially. In fact, it is more likely that an individual would have not reasonably consented to the use of data or is not aware of a specific new data analysis. BD tools and techniques seem to be shrouded in secrecy, while they are used to collect all kinds of private data, something Richards and King (2013) refer to as Transparency Paradox.

People’s personal characteristics are of great interest to businesses to better target their promotional campaigns at right market segments. However, when data about these characteristics cannot be directly collected from individuals, inferences could be made using online data, such as web browsing behavior. For example, these analyses could infer probability of a person being a female, introverted, drug user, etc. BD is enabling the creation of very detailed and rich profiles of individuals. Assumptions made based on BD analytics are especially of great concern.

Another benefits of BD is helping businesses better understand their customers and provide new value through customized products and services. This could be supported by predictive analytics, one of the types of BD analytics. Predictive analytics could be further broken down into preferential predictions (such as recommendation systems used by Amazon or Netflix) and preemptive predictions (Kerr and Earle, 2013). Preemptive predictions, however, used to diminish a person’s range of future options, can be unfair and discriminatory. Even though governments and corporations use BD to preempt activities in order to reduce risk, this is often done with little or no transparency or accountability. For example, loan companies use algorithms to determine interest rates for clients with little or no credit history to decide who is at risk for default. Orbitz, the travel website, shows higher priced deals to customers searching from Apple computers than those from PCs. Orbitz assumed Apple computer owners are willing or able to pay more. In all these cases, individuals are unable to observe and respond to information gathered or assumptions made about them.

BD relies on a tightly interconnected infrastructure of technologies and organizations.

Businesses share or sell the data they own to their partners/vendors or to data brokers. Data brokers monetize data by aggregating information from multiple sources. This clearly raises concern about protecting customer data. In analyzing nine large data brokers, the Federal Trade Commission (FTC) found that the data broker industry was a complex and multilayered industry, collecting data from both public and private sectors (Huerta and Jensen, 2017). In addition, data brokers in question acquired data from other data brokers, creating a complex chain of custody. One of the brokers analyzed had compiled profiles containing 3,000 data points on every U.S. consumer. Another way organizations are connected in the context of data is through their outsourcing of data services. Due to their massive size, BD often cannot be stored on traditional corporate servers causing organizations to outsource their data storage and processes to third parties. This dependency significantly increases the potential for security and privacy leaks.

Given this connectivity, it has even been suggested that BD should be considered an industry, not a technology (Martin, 2015). The separate and distinct firms in the BD industry work through agreements to produce a product (BD) for customers. The privacy implication here is that customer data protection depends not only on the privacy practices of the company that initially collected the data, but also on data collection practices of other participants in the BD industry (Martin, 2015). For example, when Facebook seeks to use information from data brokers such as Acxiom and Datalogix, it should worry about their data collection methods.

Another aspect of BD risk management is data veracity or quality. Poor quality may come from inaccuracies in the data from the manner in which the data was collected, data manipulation, or users providing purposefully incorrect data. Incorrect outcome of BD analysis could impact people’s lives. Consequences could be: denied credit, losing jobs, harsher or unfair judicial punishment, increased insurance cost or actual financial loss, etc. The consequences will have negative impact not just to individuals but to organizations and society at large.


Traditionally, privacy issues have been dealt using two approaches: informed consent and anonymity. However, applicability of these two approaches have become limited in the context of BD. ‘Informed consent’, which involves notice and choice, has not been very effective to begin with, even before the BD era. Research has shown that users do not read lengthy privacy policies or End User Agreements. Efforts to simplify these disclosures would lead to loss of fidelity, omission of necessary details to explain information practices. This has been referred to as the “transparency paradox” (Barocas and Nissenbaum, 2014). Notice and choice have become even more problematic in the era of BD, because it is difficult to foresee future uses of the data collected today: data could be shared, sold and analyzed for secondary uses unrelated to their original purpose of collection; organizations cannot realistically anticipate what the data will reveal until after intensive data analysis is complete. In other words, even if the customer read and understood the privacy policy, the policy may be incomplete.

A second approach to deal with privacy, more technical in nature, is data anonymity or de-identification. This involves removing any personally identifiable information while retaining the research utility of the data. In the era of AI and ML, creating data ecosystems is of the utmost importance for scientific discoveries (such as preventing or curing diseases), smart cities, and safe world. De-identification is an obvious response to privacy concerns related to BD. In fact, the proposition that BD does not include any personally identifiable data is one of the common arguments for dismissal of privacy concerns related to BD. However, experts warn that this is a dated concept (Hutnik and Drye, 2017). When data from various sources are combined, new personal information is created, or data is re-identified. Technological advances support combining disparate pieces of data that could lead the customer identification even if the individual pieces of data are not personally identifiable. For example, even if locationbased data collected through mobile devices may not be considered personally identifiable, when combined with other data, it could create an identifiable profile of a user.

Other technical mechanisms to deal with BD privacy have been developed in recent years. They can be grouped based on the stages of the BD life cycle: data generation, storage, and processing (Jain et al., 2016). In the data generation phase, access restriction as well as falsifying data techniques are used. In data storage phase, privacy protections are based mainly on encryption procedures. In the data processing phase, Privacy Preserving Data Publishing (PPDP), anonymization techniques such as generalization and suppression, are utilized to protect the privacy of data.


Privacy regulations around the world vary from country to country. At one end of the spectrum, countries like China and Thailand lack many of the foundational regulations found in most other countries (Shey and Iannapollo, 2017). On the other end are the European countries, with a deep commitment to protecting individuals’ right to data privacy. The EU’s GDPR, a substitute (on 25th May, 2018) for the EU Data Protection Directive is expected to provide users more control over the ways that their personal information is collected and used. Within Europe, GDRP expands the data protection scope so that it applies to anyone or any organization that collects and processes information related to EU citizens, no matter where they are based or where the data is stored (EUGDRP, 2018). GDRP also changes the definition of personal data. For example, it will include IP addresses or cookies. Based on the new regulation, organizations that store and analyze large amounts of data will be required to have a data protection office. Organizations need to keep records of all personal data, be able to prove that consent was given, and show where the data is going, what it is being used for, and how it is being protected. It even has a vague (on purpose, as it is still technically hard to do) provision for “right to explanation” to the consumers for automated decision-making (by AI) (Porup, 2018). Penalties for not complying with the regulations are substantially increased.

There has always been a gap in terms of privacy regulations between US and EU. This has important implications for US companies. Many US firms depend on access to and use of the personal information of EU citizens to provide data-driven services on the continent. Cross-border information flows represent the fastest growing component of trade in both EU and the US (Shwartz, 2017). However, there have always been concerns on the EU side about insufficient privacy protection in the US. The European data protection system centers itself around the data subject as a bearer of rights. It views data privacy as part of its legal culture of fundamental rights, while the US anchors its information privacy law in the marketplace (Shwartz, 2017). To bridge the gap between EU and the US, the Safe Harbour Agreement was introduced in 1999. This agreement would allow US companies to handle personal data of Europeans as long as they provided data protection for this subset of data that was similar to EU standards. However, in 2015, this agreement was voided in light of events such as Snowden leaks that put doubt into the adequacy of data protection in the US. A new agreement “EU-US Privacy Shield’ took place in August 2016. Privacy Shield requires ‘opt-in’ before processing sensitive data of EU customer. In addition, if US companies want to use data for other purposes than its original purpose of collection, they have to collect a new consent from data subjects. Privacy Shield also places restrictions on the transfer of such data to third parties.


Today businesses operate in the digital era. The old economic model within the national boundary has given way to the new paradigm of the networked global economy where capitalism can flourish without capital (Haskel and Westlake, 2017). The accelerating boundary-less ecosystem, including global supply chains and innovation alliances, and powered by advanced technologies, has been created. In this dynamic environment, organizations must be agile, flexible, resilient, absorptive, speedy, and innovative to survive and prosper. Thus, digitalization and disruptive smart innovations have become imperative for organizational transformation. Advances in automation, BD and analytics, AI, IoT, cloud and edge computing, have facilitated the creation of smart automobiles, smart homes, smart factories (Industry 4.0), smart organizations, smart cities, smart infrastructures, smart countries, and the aspiration of a smart future (Lee and Trimi, 2017).

The innovation system is no longer an isolated island of R&D of an organization. The new innovation paradigm advocates co-innovation (Lee et al., 2012) where new ideas come from many different sources: outside-in (including customers, suppliers, partners, and even competitors), inside-out (alliances, licensing, etc.), collective intelligence (open innovation, open source, crowd sourcing, data ecosystems), and convergence of technologies. BD is what is driving the innovation system. The BD system is an organic living system that can sense the pulse of the environment, analyzes relevant data to extract useful information for strategic decisions, and evaluate the performance of operational systems. The BD system must utilize the most advanced available technologies to keep the system humming: including smart sensors, Internet of Things (IoT), Internet of Brains (IoB), machine learning, 3-D technologies, blockchain technologies, cloud computing, and the like. Then, to harmonize these technologies to support the BD system, a data-friendly artificial intelligence (AI) ecosystem must be developed for discoveries, innovations, on-demand delivery of appropriate information and automation of management decision making.

The antagonistic dichotomy of open/sharedata vs. closed/private-data has become bigger and more important than ever to resolve. Privacy and security issues are “wicked problems” that involve conflicting values of many stakeholders (Lowry et al., 2017) and lie at the intersection of individual, organizational, technological, legal, and ethical implications. Opening government data spurs private-sector innovation; opening private-sector data enhances government capability to provide security and improve the welfare of citizens; collecting information from individual and broadly sharing it help new discoveries, innovations, betterment of life. Privacy, however, is a very important right of individuals. Regulations are still evolving, and are globally diverse. There is no international cooperation for data (collection, access, usage) regulation and processing (algorithms – sharing advances and also regulating/controlling bad use of data and AI, such as AI weaponry). Governments should take care of safety of their citizens, while respecting citizens’ privacy and not violating laws and democratic rules. Organizations should not only comply with multi-countries privacy regulations, but also meet customer expectations and beyond (Huerta and Jensen, 2017).

To conclude, big data, collected and shared properly, and used by sophisticated AI, will create enormous opportunities, innovations, and insights that will solve a large number of problems that will have great scientific, economic and social impact. Integration of siloed data across platforms, organizations, cities, and countries can really bring enormous benefits to the welfare of society. Developing a smart “Deep BD” system supported by AI, capable of managing the wicked problems, should be the goal.


Abbasi, A., Sarker, S., & Chiang, R. (2016). Big Data research in information systems: Towards an inclusive research agenda, Journal of Association for Information Systems, 17(2), Article 3.

Baesens, B., Bapna, R., Marsden, J., Vanthienen, J., & Zhao, J. (2016). Transformational issues of Big Data and analytics in networked business, MIS Quarterly, 40(4), pp. 807-818.

Barocas, S., & Nissenbaum, H. (2014). Big Data’s end run around procedural privacy protection, Communications of the ACM, 57(11), pp. 31-33.

Barton, D., Woetzel, J., Seong, J., and Tian, Q. (2017). Artificial intelligence: Implications for China, Report McKinsey Global Institute, April 2017,, Accessed February 4 2018. Davis, J. (2018).

Unlocking the value: From data quality to artificial intelligence, InformationWeek, big-data/ai-machine-learning/unlock-the-value-from-data-quality-toartificial-intelligence/a/d-id/1331076, Accessed March 1, 2018.

EUGDRP, EU General Data Protection Regulation, (2018). https://, Accessed March 1 2018.

Haskel, J. and Westlake, S. (2018). Capitalism without Capital. The Rise of the Intangible Economy, Princeton University Press.

Hermstruwer, J. (2017). Contracting around privacy: The (behavioral) law and economics of consent and Big Data, Journal of Intellectual Property, Information Technology & Electronic Commerce Law, 8, pp. 9-26.

Huerta, H., & Jensen, S. (2017). An accounting information systems perspective on data analytics and big analytics, Journal of Information Systems, 31(3), pp. 101-114.

Hutnik, A., & Drye K. (2017). A privacy roadmap for avoiding big risks with Big Data, Inside Counsel,, Accessed February 28 2018.

Jain, P., Gyanchandani, M., & Khare, N. (2016). Big Data privacy: A technological perspective and review, Journal of Big Data, 3(25).

James, T., Nottingham, Q., Collignon, S., Warkentin, M. & Ziegelmayer, J. (2016). The Interpersonal Privacy Identity (IPI): Development of a privacy as control model. Information Technology and Management, (17), pp. 341-360.

Kayhan, V., & Davis, C. (2016). Situational privacy concerns and antecedent factors, The Journal of Computer Information Systems, 56(3), pp. 228-237.

Kerr, I., & Earle, J. (2013). Prediction, preemption, presumption: How Big Data threatens big picture privacy, Stanford Law Review Online, 66, pp. 65- 72.

Lee, S. M., Olson D., & Trimi, S. (2012). Co-innovation: Convergenomic, collaboration, and co-creation for organizational values, Management Decision, 50(5), 817-831.

Lee, S. M. & Trimi, S. (2017). Innovation for a smart future, Journal of Innovation and Knowledge, doi:10.1016/j.jik.2016.11.001.

Lowry, P., Willison, R., & Dinev, T. (2017). Why security and privacy research lies at the centre of the Information Systems (IS) artefact: Proposing a bold research agenda, European Journal of Information Systems (EJIS), 27(6), pp. 546-563.

Martin, K. (2015). Ethical issues in the Big Data industry, MIS Quarterly Executive, 14(2), pp. 67-85.

Malhotra, N., Kim, S., & Agarwal, J. (2004). Internet users’ information privacy concerns (IUIPC): the construct, the scale, and a casual model, Information Systems Research, 15(4), pp.336–355.

Pink, S., Lanzeni, D., & Horst, H. (2018). Data anxieties: Finding trust in everyday digital mess, Big Data and Society, January-June, pp. 1-14.

Porup, J. M.,(2018). What does the GDPR and the “right to explanation” mean for AI? CSO from IDG, 02/09/2018,, Accessed March 2 2018.

Richards, N., & King, J. (2013). Three paradoxes of Big Data, Stanford Law Review Online, 66, pp.41-46.

Shey, H., & Iannopollo, E. (2017). Compliance strategy is just the start of your privacy program, CIO,, Accessed March 8 2018.

Smith, H., Milberg, S., & Burke, S. (1996). Information privacy: measuring individuals’ concerns about organizational practices, MIS Quarterly, 20(2), pp.167-196.

Schwartz, P., & Peifer, K. (2017). Transatlantic data privacy law, The Georgetown Law Journal, 106, pp. 115-179.

The Economist. (2016). The return of the machinery question, Special Report: Artificial Intelligence, June 25th 2016.

Sylvana Faja Silvana Faja is a Professor of Computer Information Systems at the University of Central Missouri. She received her Ph.D. in Management Information Systems from the University of Nebraska-Lincoln. Her research areas include issues of privacy and security, electronic commerce, big data and analytics, e-health, agile development, and IS education topics such as online education, collaborative learning and adoption of agile development in the classroom. She has published in journals such as Journal of Software Maintenance and Evolution, Communications of the Association for Information Systems, International Journal of Electronic Business, Information Systems Education Journal, Service Business: An International Journal, and Journal of Information Science and Technology. She has served as a reviewer for several journals and conferences and as technical reviewer for textbooks.
Sylvana Trimi Silvana Trimi is an Associate Professor in Department of Supply Chain Management and Analytics at the University of Nebraska – Lincoln. Her research interests are on Big Data, Artificial Intelligence and Machine Learning, Green IT and Supply Chain Management, Social Networking, Organizational and IT Innovation, Digital Convergence, and Knowledge Management. She has published more than 60 articles in such journals as Communications of the ACM, International Journal of Production Research, Journal of World Business, Communications of the AIS, Information and Management, Journal of Computer Information Systems, Industrial Management and Data Systems, International Journal of Public Administration, Journal of Innovation and Knowledge, International Journal of Knowledge Management, Management Decision, and others.
Updated: April 18, 2018 — 4:12 pm

Leave a Reply