Data sources have mostly turned digital

Yüklə 445 b.

tarix	27.10.2017
ölçüsü	445 b.
	#16765

Data sources have mostly turned digital

Data sources have mostly turned digital

Analog processes

e.g., photography, films

Paper-based interactions

e.g., banking, e-administration

Communications

e.g., email, SMS, MMS, Skype

Where is your personal data? … In data centers

112 new emails per day  Mail servers
65 SMS sent per day  Telcos
800 pages of social data  Social networks
Web searches, list of purchases  google, amazon

Is this good news ?

Is this good news ?
$2 billion a year spend by US companies on third-party data about individuals (Forrester Report)
$44.25 is the estimated return on $1 invested in email marketing (oil is up to 0.5$/yr)
High Market Value Companies

Facebook: value / #accounts 50$
Google: $38 billion business sells ads based on how people search the Web
Amazon (knows purchase intent), mail order systems companies (gmail), loyalty programs (supermarkets), banks & insurrance, employement market (linkedIn, viadeo), travel & transportation (voyages-sncf), the « love » market (meetic), etc.

All these data analytics are run on « centralised » (e.g. data centers)

All these data analytics are run on « centralised » (e.g. data centers)
Intrinsic problem #1: personal data is exposed to sophisticated attacks

High benefits to successful hack (or leak…)
One person negligence may affect millions

Intrinsic problem #2: personal data is hostage of sudden privacy changes

Centralised administration of data means delegation of control
This leads to regular changes, with application (and business) evolution, with mergers and acquisition, etc. (e.g Facebook 2012)

Increasing security is only a partial solution since does not solve those intrinsic limitations

E.g., TrustedDB [BS12] proposes tamper-resistant hardware to secure outsourced centralized databases.

A Personal Data Ecosystem…

A Personal Data Ecosystem…
… built around user-centricity and trust,
achieved through a decentralized architecture
with the same computing expressivity

1. Users store their own data

1. Users store their own data
 minimize abusive usage
2. Auto-administered platform
 no DBA attack (even by user)
3. Enforce privacy principles for externalized (shared) data
 best if the recipient of the data is another TC
4. Tamper-resistance + certified code/secure execution + single user + physical access needed
 ratio cost/benefit of an attack is very high

Token Characteristics :

Token Characteristics :
High security:

High ratio Cost/Benefit of an attack;
Secure against its owner;

Modest computing resources (~10Kb of RAM, 50MHz CPU);
Low availability: physically controlled by its owner; connects and disconnects at it will

PROBLEM :

PROBLEM :

How to perform global queries on the asymmetric architecture? (i.e. using data from many/all cells)

Several approaches are possible to securely perform global computations:

Several approaches are possible to securely perform global computations:
Use only an untrusted server/cloud/P2P and use generic (and costly) algorithms. (e.g. Secure Multi-Party Computing [Yao82, GMW87, CKL06], fully homomorphic encryption [Gent09]) Problem = COST
Use only an untrusted server/cloud/P2P and develop a specific algorithm for each specific class of queries or applications. (e.g. DataMining Toolkit [CKV+02]) Problem = GENERICITY
Introduce a tangible element of trust, through the use of a trusted component and develop a generic methodology to execute any centralized algorithm in this context. ([Katz07, GIS+10, AAB+10])  Problem = TRUST

Querier:

Querier:
Shares the secret key with TDSs (for encrypt the query & decrypt result).
Classical Access control policy (e.g. RBAC):

Cannot get the raw data stored in TDSs (get only the final result)
Can obtain only authorized views of the dataset ( do not care about inferential attacks)

Supporting Server Infrastructure:
Doesn’t know query (so, attributes in GROUP BY clause) b/c query is encrypted by Querier before sending to SSI.
Has prior knowledge about data distribution.
Honest-but-curious attacker: Frequency-based attack

SSI matches the plaintext and ciphertext of the same frequency.
e.g. investigates remarkable (very high/low) frequencies in dataset distribution

The main difficulty is with AGGREGATE QUERIES !!

The main difficulty is with AGGREGATE QUERIES !!
Solutions vary depending on which kind of encryption is used, how the SSI constructs the partitions, and what information is revealed to the SSI.
Secure aggregation solution (presented briefly here)
Noise-based solutions (see paper)

random (white) noise
noise controlled by the complementary domain

Histogram-based solutions (see paper)
We investigate these solutions along the directions of performance and security.

Secure Aggregation Efficiency problem :

Secure Aggregation Efficiency problem :
nDet_Enc on AG  SSI cannot gather tuples belonging to the same group into same partition.

Distribution of AG is discovered and distributed to all TDSs.

Distribution of AG is discovered and distributed to all TDSs.
TDS allocates its tuple to corresponding bucket.
TDS send to SSI: {h(bucketId),nDet_Enc(tuple)}
Consequences :

Internal time consumption

Internal time consumption

Dataset size Ttuple : varies from 5 to 65 million

Dataset size Ttuple : varies from 5 to 65 million
Number of groups G : varies from 1 to 106
Number of TDSs participating in the computation as a percentage of all TDSs connected at a given time Ttds : varies from 1% to 100%).
We fix two parameters and vary the other, measuring : execution time, parallelism of the protocol, total load, maximum load on one TDS
When the parameters are fixed :
Ttuple =106, G=103, % of TDS connected = 10% of Ttuple.
We also compute and use the optimal value for all reduction factors as well as for.
In the figures, we plot two curves for Rnf_Noise protocols RN (nf = 2) and WN (nf = 1000) to capture the impact of the ratio of fake tuples.

Experimental Scalability (experiments on LIPN cluster) TODS’16

Total Load

Total Load

Select ..

Select ..
From ..
Where ..
Group By AG
G = card (AG)
Security: S_Agg > ED_Hist
Performance:

G > 10:
ED_Hist faster than S_Agg
G <= 10:
ED_Hist slower than S_Agg

Short/Middle term research : Data intensive Computing on an Asymmetric Architecture

SQL (With SMIS)

Queries here do not have joins !
Take into account more attack models (e.g. Broken Tokens)
Field experiment on usability (with ISN / A. Katsouraki PhD thesis)
Add usage control (A. Michel PhD thesis)

Private/Secure MapReduce (With LIPN -- some results in Coopis’15)

Investigate compatibility of our protocols.
Develop new protocols.
Check performance !

Secure Graph computations (With LIX)

Study social networking applications
Secure K-core and k-truss computations (Rossi PhD thesis)

XML management

Adapt the work on XQ2P (Butnaru, Gardarin, Nguyen) to the Trusted Cells context.
Distributed Window Queries.

Promoting the Trusted Cells vision

Trusted Cells “Core”

Open hardware and software bundle : basic functionalities

Local DB
Distributed DB
NoSQL DB

 needed to develop PbD personal data management applications !
Promote an open source community around Trusted Cells (UVSQ, INSA CVL, ENSIIE, INSA Lyon…)

Beyond Tamper Resistant HW

Results are useable even with lower trust elements.
Include social trust / reputation.
Use virtualization.

Yüklə 445 b.

Dostları ilə paylaş:

Data sources have mostly turned digital

Data sources have mostly turned digital

Data sources have mostly turned digital

Where is your personal data? … In data centers

Is this good news ?

All these data analytics are run on « centralised » (e.g. data centers)

All these data analytics are run on « centralised » (e.g. data centers)

Intrinsic problem #1: personal data is exposed to sophisticated attacks

Intrinsic problem #2: personal data is hostage of sudden privacy changes

Increasing security is only a partial solution since does not solve those intrinsic limitations

A Personal Data Ecosystem…

A Personal Data Ecosystem…

… built around user-centricity and trust,

achieved through a decentralized architecture

with the same computing expressivity

1. Users store their own data

1. Users store their own data

 minimize abusive usage

2. Auto-administered platform

 no DBA attack (even by user)

3. Enforce privacy principles for externalized (shared) data

 best if the recipient of the data is another TC

4. Tamper-resistance + certified code/secure execution + single user + physical access needed

 ratio cost/benefit of an attack is very high

Token Characteristics :

Token Characteristics :

High security:

Modest computing resources (~10Kb of RAM, 50MHz CPU);

Low availability: physically controlled by its owner; connects and disconnects at it will

PROBLEM :

PROBLEM :

Several approaches are possible to securely perform global computations:

Querier:

Querier:

Shares the secret key with TDSs (for encrypt the query & decrypt result).

Classical Access control policy (e.g. RBAC):

Supporting Server Infrastructure:

Doesn’t know query (so, attributes in GROUP BY clause) b/c query is encrypted by Querier before sending to SSI.

Has prior knowledge about data distribution.

Honest-but-curious attacker: Frequency-based attack

The main difficulty is with AGGREGATE QUERIES !!

The main difficulty is with AGGREGATE QUERIES !!

Solutions vary depending on which kind of encryption is used, how the SSI constructs the partitions, and what information is revealed to the SSI.

Secure aggregation solution (presented briefly here)

Noise-based solutions (see paper)

Histogram-based solutions (see paper)

We investigate these solutions along the directions of performance and security.

Secure Aggregation Efficiency problem :

Secure Aggregation Efficiency problem :

nDet_Enc on AG  SSI cannot gather tuples belonging to the same group into same partition.

Distribution of AG is discovered and distributed to all TDSs.

Distribution of AG is discovered and distributed to all TDSs.

TDS allocates its tuple to corresponding bucket.

TDS send to SSI: {h(bucketId),nDet_Enc(tuple)}

Consequences :

Internal time consumption

Internal time consumption

Dataset size Ttuple : varies from 5 to 65 million

Dataset size Ttuple : varies from 5 to 65 million

Number of groups G : varies from 1 to 106

Number of TDSs participating in the computation as a percentage of all TDSs connected at a given time Ttds : varies from 1% to 100%).

We fix two parameters and vary the other, measuring : execution time, parallelism of the protocol, total load, maximum load on one TDS

When the parameters are fixed :

Ttuple =106, G=103, % of TDS connected = 10% of Ttuple.

We also compute and use the optimal value for all reduction factors as well as for.

In the figures, we plot two curves for Rnf_Noise protocols RN (nf = 2) and WN (nf = 1000) to capture the impact of the ratio of fake tuples.

Experimental Scalability (experiments on LIPN cluster) TODS’16

Total Load

Total Load

Select ..

Select ..

From ..

Where ..

Group By AG

G = card (AG)

Security: S_Agg > ED_Hist

Performance:

Short/Middle term research : Data intensive Computing on an Asymmetric Architecture

SQL (With SMIS)

Private/Secure MapReduce (With LIPN -- some results in Coopis’15)

**Where is your personal data? … In data centers**

**TDS send to SSI: {h(bucketId),nDet_Enc(tuple)}**

**In the figures, we plot two curves for Rnf_Noise protocols RN (nf = 2) and WN (nf = 1000) to capture the impact of the ratio of fake tuples.**