337
I'm a computer scientist studying creepy things we can do with your online data – AMA
Edit: Thanks everyone. Sorry for posting this too early - I appreciate your patience. I'm done for now, but I'll try to catch up with all the unanswered questions over the next day or so. -Jen
My short bio:
I'm a professor at the University of Maryland and Director of the Human-Computer Interaction Lab there. I've written a book, Analyzing the Social Web, on how to analyze social media, and my research focuses on social media, computing, and privacy. I've also written for Slate and the Atlantic.
Even if you try to keep it private, using computer models, we can find out all kinds of information about you from your Facebook/Twitter/other social media profile – sexual orientation, political leanings, personality traits, drug and alcohol habits, etc. The science behind this is fascinating, but it also raises really interesting questions about privacy and what control you should have over your data.
This is what I spend all my time working on. Want to know what we can find out about you, how it works, and what it means? AMA!
My Proof:
More info at my TED talk here: http://www.ted.com/talks/jennifer_golbeck_the_curly_fry_conundrum_why_social_media_likes_say_more_than_you_might_think
More about me at http://en.wikipedia.org/wiki/Jen_Golbeck
Twitter: http://twitter.com/jengolbeck
jengolbeck35 karma
This is an interesting question because it highlights how we are NOT using your data. While places like Amazon know what you have purchased, that doesn't always get incorporated into the algorithms to tell it to stop recommending things to you that you've already bought. It's a place where more data could make it better, but there are a lot of concerns about if information from all parts of the system should be integrated together. But you are right– without question, we have a long way to go in making better recommendations.
hiveNzin020 karma
1) What can you do with all the information ? Is it only for advertisements ?
I block them all so I will never know that you know.
2) Let's take my facebook account, everything is private, all the apps that can post anything public with my authorization are only for me. What can you find about me if I don't accept you in my friend list ?
jengolbeck11 karma
Ads are the place where there seems to be money in this now. However, I often (half) joke that if I get bored with this job, I would start a company that aggregates a lot of information about people, makes inferences over it (inferring things like commitment to your job, how well you work with others, how much of a procrastinator you are, etc.) and sell that report to businesses like your credit report gets sold. I think there is a lot of opportunity to make money off this data, but we are just starting to see this happen.
jengolbeck15 karma
I mentioned it in another response, but if you DO want to analyze yourself, you can use http://www.analyzewords.com/ on your Twitter account. It uses text analysis tools from LIWC at liwc.net to create a brief psychological profile. It's pretty cool.
jengolbeck29 karma
I have mixed feelings. As someone who has had a security clearance and who works with the government on a lot of projects, I think it was a very serious violation for him to share so much classified information. On the other hand, if the government (or part of it) is violating the laws designed to protect citizens, I understand his motivation for wanting to do something about it. He could not have achieved the same level of attention to these details in any other way.
But, I honestly try to focus my attention more on the science side than on the law/policy side. The latter is extremely important, but not where I have my deepest expertise.
ftapon11 karma
What do you say to people who say, "Privacy is overrated. If the government snoops on you and finds you doing something illegal, then isn't that a good thing for society? If you're not breaking the law, what do you have to worry about? You'll benefit from all the bad guys we catch."
jengolbeck10 karma
I think there are two arguments to make. On the government side, it is, in some ways, an easier discussion because there are lots of laws about how the government can collect information on you and use it. There are definitely issues to discuss there, but there is a guiding framework of what the government should be allowed to do.
On the non-government side it gets tricky. As MashCaster pointed out in response to your question, people get fired for things they post online. I am working on a book now on how to conduct investigations through social media, and I have heard from dozens of family lawyers who talk about how they use social media in custody and divorce cases. The fact is that even if you aren't doing something wrong, there can be ways that information about your illegal activities can be used against you– whether it is honest or twisted a little bit. I think it's naive to pretend that privacy doesn't matter; it does, especially when you are involved with people (like in legal matters) who do not want to give you the benefit of the doubt.
Talexe11 karma
Where do you draw the 'creepy line' when it comes to using online data, if at all? For example, do you think automatically scanning emails for information is acceptable - and will it continue to be as techniques grow ever-more sophisticated?
jengolbeck8 karma
I think each person's creepy line is different. I consider a lot of the stuff we can do - guessing who you will vote for, identifying your personality traits, etc - as kind of creepy because it can discover information you very explicitly try to keep private. Even more, the ability to compute that can come from things you don't expect. I tell the story in my TEDx talk linked above that liking the Facebook page for Curly Fries was shown to be one of the top predictors of high intelligence in a large study from Cambridge. That don't make a lot of sense, which means it can be very hard for an individual to prevent these algorithms from learning things about them.
almosthere03273 karma
I assume you guys are doing your due diligence with the statistics side of these inferences (i.e. correlation doesn't imply causality) so what real information can liking Curly Fries' facebook page tell you? Couldn't there be a bias (i'm smart so lots of facebook friends i have are also smart, my friend liked "curly fries" on facebook, insert hivemind/butterly effect) that makes some of these predictors relatively useless? I feel like what that statistic really demonstrates is the interconnectedness of intelligent people on social media. In the military/fraternity/sales worlds people will tell you the number one reason most people join is because they were asked to, and when my friend "likes" something on facebook there's a good chance I'll "like" it too. Today I liked that my friend was listening to a Red Hot Chili Peppers song. Enough to see them in concert? Probably not. Enough to buy the song? Maybe. Enough to click the image of a thumbs-up button? Sure!
I guess what I'm saying is, without divulging any trade secrets can you give an example of how data becomes a reasonably certain (for whatever p- or f- or t-values you use) inference?
jengolbeck6 karma
couldn't there be a bias (i'm smart so lots of facebook friends i have are also smart, my friend liked "curly fries" on facebook, insert hivemind/butterly effect)
You nailed it here. This isn't a bad thing - these kinds of patterns are what the algorithms are based on. This is a principle called homophily - you are friends with people like you. It is a huge part of why these algorihtms can work.
Also, we are just looking at correlation, and that's ok. Liking curly fries correlates with high intelligence. We don't care why - the models just use that correlation to make a prediction.
But one point that follows from your comments is that this data is volatile. It could be that curly fries correlates with intelligence today, but it won't next month (because people unlike it and others like it). That means you need a lot of ground truth data (e.g. actual intelligence scores for people) to rebuild the models frequently.
qwerty_____9 karma
2 hours and no responses?
also - since you are from the University of Maryland, whata re some interesting or otherwise overlooked courses you would recommend?
I am an intern learning information security right now and would be interested in your input.
Also, what colleges/universities would you recommend for information technology? ones that I should avoid?
jengolbeck2 karma
In addition to computer science, which has been my focus, I did an undergrad degree in economics. That has been so useful to me throughout all my work. If you can find classes in behavioral economics, I'd really recommend them.
As for universities, you can find good IT programs anywhere. I would recommend you consider what you want to get out of it when you're done - do you want to go to grad school? get a PhD? get a job doing security? work for a big company as internal IT support? Knowing a bit about that will help you find a department / university that caters to your interests.
qwerty_____1 karma
what relevance does behavioral economics have in security? (serious question)
I live in Maryland, which is why I figured I'd ask you. I'm not entirely sure what I would like to eventually do, I was just wondering if there are any colleges in Maryland that you have a high regard for.
jengolbeck2 karma
Behavioral economics help you understand why people do what they do and how they make decisions.
Your security systems have people using them.
Thus, the more you understand about the human users, the more you can do to design systems that are usable and secure!
GershBinglander8 karma
Has your work made you more parinoid about your own privacy?
What things do you do to protect your own data?
Can the data be used for good and if so in what ways?
jengolbeck5 karma
I think I was about this paranoid before, but I'm a bit more informed about it now :)
I use a lot of Firefox plugins to block tracking cookies. DoNotTrackMe is a good one, but I probably have 6 installed. (Note - this sometimes means sites don't work, so I have a second browser running to hit the occasional site that won't function with all my blockers).
I also keep my social media pretty carefully limited. My facebook page only has my most recent 3 or 4 weeks' worth of activity. I deleted everything older than that, and go through around once a week and delete all the things more than 3 weeks old (all my likes, comments, posts, etc). That limits what can be inferred about me from my profile.
I wrote about that process (including some good tools) here: http://www.slate.com/articles/technology/future_tense/2014/01/facebook_cleansing_how_to_delete_all_of_your_account_activity.html?wpisrc=burger_bar
I don't think you need to go as far as I did, but cleaning out old stuff so your data footprint is smaller definitely limits what can be done with your profile data.
jengolbeck3 karma
Looks like that's a problem with the user scripts server. This should work: http://userscripts.org:8080/scripts/show/122073
And thanks for the note - I'll have them update the link in the article.
cmrivers6 karma
Hi, I coauthored a paper on using Twitter data responsibly for research (http://f1000research.com/articles/3-38/v1). Data like tweets are public, but they can be used in ways that violate privacy - like snowballing information across various sites. But given that it is all public, do these methods violate privacy? Do researchers have any responsibility to protect that privacy? Would love to hear your thoughts.
jengolbeck5 karma
I deal with these issues a lot as a researcher, as you know. My strategy has been to use the public data for research, but not to release the actual data from my experiments when I publish information about the algorithms I develop. People can replicate the experiments on other data; in fact, if they can't, it would show a weakness in my work.
But it's a hard question about whether this violates privacy. My personal thoughts on it are that using the tweets is fine. They really are public. However, once you do things with that data, you can end up with information that people never intended to share, and you can find that in ways that no human could understand. The actions that predict behaviors / traits often don't have any obvious meaningful connection. In that case, I think if you make the inferred information public, you are violating privacy. I think people should consent to how their information is used. If they make tweets public, they consent. But I don't think it's fair to assume an average user would understand how their actions lead to the inferences we make, so there really is no consent there.
packet_splatter6 karma
What happens if you do not have an account on one of the popular social media websites; what can you find out then?
jengolbeck3 karma
That's a lot harder. I tend only to work with public open data (for research ethics reasons). However, even without an account, companies who use persistent cookies, especially advertising companies, can track you across the web using your IP address. There has not been a ton of work on what inferences you can draw from that, since researchers typically don't have access to that information. However, I suspect it could still be informative for some things.
PvP_Noob4 karma
I work in the Big Data space monetizing consumer data. We are constantly working with our legal and privacy teams to do so in an annonymous aggregated fashion with no ability to re-identify anyone.
My bias may be professional but I am not sure I agree with your sentiment that corporations having access to this information is a bad thing. If we violate our consumers trust we not only fail to monetize our data assets we drive off the very consumers who also pay our bills and we lose, big time. We also recognize that their must be a value exchange between us and our consumers. I suspect in the long run most consumers will allow data tracking as they find the more targeted and relevant marketing to them to be "Worth it".
I would love to see the conversation move from data about you is bad, anyone who uses it is bad and shift it to a discussion on how we can collectively use information to drive value for both consumers and corporations and get to a place where everyone is better off.
jengolbeck4 karma
I'm not sure I ever said it was bad for companies to have this data. There are tradeoffs. You give up some privacy, but you get a lot of benefits, too. I really argue that people don't understand how their data is being used and they should be able to consent to how it is used. Some are fine with getting ads based on their email or searches. I often really like the ads I get with my google searches. Other people feel spied on or violated by this. I think there should be a larger discussion about what rights people have to their data and how it is used.
jengolbeck3 karma
Step 1. Collect extensive data from at least tens of thousands of users. More is better.
Step 2. ?
Step 3. Profit.
The ? in step 2 can be replaced by implementing some of the many algorithms people discuss in the literature, but the core reason these creepy inferences aren't used extensively is because the algorithms require LOTS of data to work well. Most people just don't have access to that. It's hard, time consuming, and expensive to get (unless, of course, you work at a company that collects it).
Colopty2 karma
What if my goal is not to profit, but rather to be able to walk up to someone and creep them the hell out? Y'know, just enough to haunt their nightmares a little.
jengolbeck4 karma
This guy did that and the result is pretty awesome: https://www.youtube.com/watch?v=5P_0s1TYpJU
jengolbeck6 karma
Yep. 9:50 this Saturday at Kettler Capitals Iceplex if you'd like to come watch!
mcymo3 karma
Oh cool, I have many questions:
- Do you only analyze data you can get form social networks or do you also analyze other sources like what you can get from e.g. browser fingerprint or google-analytics?
- If so what are the different conclusions you can derive from one single or the respective combination of these sources? How much better does the quality and sum conclusions get with adding another source that complements existing sources? What is the lower threshold regarding useful information one can derive and what the upper threshold, meaning at what amount of sources (e-mail, likes, content of communication) does the result only improve margninally or insignificantly? 
- Speaking of other sources, what could you do with the information the intelligence services have like the who, when and where of cellphone call metadata and/or the content of e-mails? Is that more or less powerful that social network data? 
- I read services can derive just from the frequency of communications with a central node in the network if an attack is imminent, are you able to do these, too, and if so what else can you derive just from the frequency of communications? 
- Do you have a comprehensive list of what psychological/profiling properties you can derive from a source/sources of data?
- Is that area still changing, meaning are you able to derive more conclusions from the same amount of information and if so, how fast is it changing/improving? 
- Is the average profile a company/intelligence service can put together from the data they're retaining better than what an average psychiatrist is able to do? 
- If so, wouldn't it be nice for people to be able to get the profiles, because it could really help them with analyzing themselves and getting new insights. Also, the data is somewhat theirs, I believe. I know you can get your data-sheet from facebook if you demand, but you only get the raw-data, I mean the analysis. 
I know I had more questions, but that's it for now I hope you'd like to answer some of them and thx for this IAmA.
Edit:Grammar
jengolbeck2 karma
- You can analyze anything. I personally stick to social media in my research, but any data sources are likely to reveal things.
If you have ALL the data, as in your intelligence example, that is much more powerful. In general, the more data you have the more effective these tools are.
The second #1. Check out LIWC at http://liwc.net/. That has a great list of psychological traits they can analyze from your text. They also have a tool http://www.analyzewords.com/ that will profile you from your twitter profile. That's my favorite new toy these days.
But really, if you come up with a psychological test, you can try to predict someone's score on it from their data.
- That's a really good question. I don't think we are getting to the point where we can do more with less data. Access to more data is what we focus on because it makes things work better.
3/4. I'd say a psychiatrist can do SO much better than even the best data source because they have the benefits of context to the information they get and they have human abilities to understand human behavior in a way computers aren't even close to. That said, I think therapists would often like the added insights they can get from seeing all of a patient's social media posts.
jengolbeck3 karma
My favorite research on this topic is this article from Cambridge (which I discuss in my TEDx talk that I linked above):
http://www.pnas.org/content/early/2013/03/06/1218772110.full.pdf+html
It shows the huge number of personal traits that can be accurately predicted from someone's Facebook likes and the fact that the likes do not need to be obviously connected to the trait being predicted. For example, if someone likes the GOP page on Facebook, it's not hard to guess that they might be a Republican. But many inferences come from likes that are way less obvious or even nonsensical. It shows that even when you try to keep information about yourself private, things you do that seem to have no connection to it can reveal this information about you.
jengolbeck4 karma
Thanks!
1) I actually write most of my own code, so I'm honestly not familiar with these packages. I've used MALLET for some computational linguistics, and mostly have relied on LIWC (and MRC to a lesser extent) for psycholinguistics.
2) I suspect we are going to see the trend continue where people move toward using multiple services instead of consolidating all their activities into a single platform. We see this now with increasing popularity of chat apps, instagram, snapchat, etc, instead of everyone doing those things through Facebook.
3) I think the technology will get better, but I still forsee a big challenge for everyone outside the major tech companies having access to enough data to make them work well.
4) I think data users are going to push hard for this ability, and I think you will see some pushback from people who are described in the data. However, there isn't a lot that can give people control of their data if they don't to it being used in this way. So, if I had to predict, we will see more of this kind of use. It might be that some report comes out that causes a large enough public outcry to change the balance of data power, but we aren't there yet.
AdrianBlake3 karma
Is there a way I could use these methods to search about myself so I could edit the findable stuff?
jengolbeck2 karma
These methods tend to rely on easily accessible public data. A much more useful step would be to go through all your social media accounts, check your privacy settings, and increase them if necessary.
AdrianBlake2 karma
That's what I try to do. But aren't there companies thst can view behind the privacy settings, buy your info from facebook? Or rather buy the info of 26 year old people from Bradford doing my job... so that they're fairly sure its me?
jengolbeck3 karma
Ah yes. Some companies buy it and some get it other ways (through apps, partnerships, etc). In those cases you don't have a lot of control. You can not post the data in the first place, which is not a very helpful suggestion. You might also look at some of the data aggregators and see what they have about you. "Dragnet Nation", which I've talked about in a few other answers, talks about this in an interesting way.
entirely13 karma
Ask me anything, answer nothing.
She's scheduled this for 2pm eastern but put up the post way too early. I hope the thread survives. People are getting tetchy here.
motodriveby2 karma
I just stopped in with a mild interest and made sure she hadn't put in an edit before my snarky comment was made. On mobile there's no way to know she hadn't just disappeared...
Edit: People are starting to go a bit nuts, I myself am getting a little teste
entirely13 karma
Well she finally showed up. The crowd was starting to get surly, but I hope she doesn't get thrown to the lions.
jengolbeck6 karma
The responses to my error in posting too early has made it a less than ideal experience, but now I know better for next time. I thought the post would be held until my scheduled time. And it upsets people when it sits here without answers. Got it.
Also, entirely1, thanks for sticking up for me everywhere!
enigma_x2 karma
Hi. Thanks for doing this AMA. I'm a student of Computer Science and my area of research is Machine Intelligence -- so social media mining is a vital part of my area of interest. I've been following your research and what you've done and are doing is fascinating to me.
I have a hypothesis that our likes, preferences and online interactions do not just tell the researchers about our character traits and how we perceive things, but they dictate our future interactions as well. For instance, take the Facebook news feed. We tend to see more and more of activities of people whom we interact with and less of those we did not interact with. This means that we are not interacting with those people more because of our relationship but because we interacted with them in the past. The more you see of someone's activities the more you interact with them and thereby the algorithm forces you to strengthen the relationship with that person by showing more of that person's activities. This has had both positive and negative effects -- where people have actually formed closer relationships and ones where people don't want to see activities of these people anymore so they altogether remove them from their 'friend-list'.
- What do you think of this sort of extreme categorisation of relationships where you cannot choose to control the closeness of a person but the online social interaction does it for you. 
- Where is this heading towards? 
- Is this a focus of your research as well? If yes, what possible good can happen as a result of this? 
jengolbeck3 karma
You are right- these factors do (seem to) play into Facebook's algorithms for organizing your news feed. (the actual algorithm is private)
This is a bit outside the work I'm doing now, but I'd recommend you look at Jon Kleinberg's work. He looks at a lot of this, and he does really brilliant research http://www.cs.cornell.edu/home/kleinber/
dutis2 karma
Hi, thank's for doing AMA. My questions would be: 1) Do you think that we are slowly transforming into a society where a word privacy will stop to exist and our grandchildren will look at it the same way we do at cassette players? 2) If an average person on internet could do one thing to greatly increase their privacy, what would it be? Is it even worth trying with all the technology around?
jengolbeck3 karma
Nope, I don't think privacy is going to go away. As I mentioned to someone else in this AMA, there are situations, like legal cases, where no one is going to give you forgiveness or the benefit of the doubt for something stupid you put online in your teens. It will be used against you if it supports the other side. Privacy will always be valuable.
It is extremely difficult to get your information off the web all together (I'll plug the book "Dragnet Nation" again, which speaks to this directly). However, from my research on my current book, social media in general, and facebook in particular, is the place where people find the most information. So the best thing you can do is crank up your privacy settings, be careful about what you share (don't assume those privacy settings are iron clad), and delete old stuff that you've posted liberally and frequently. None of this is surefire protection - content is archived, people make copies, privacy settings aren't perfect, etc - but these measures will make it a lot harder for people to track down potentially negative information to use against you.
veritasserum1 karma
Are we significantly less exposed if we do not use FB and Twitter? Is there any significant effort at the moment to mine patterns of use inside of TOR?
jengolbeck1 karma
Academic researchers focus on the easiest data to get, and that's FB and twitter. Your posts elsewhere are less studied and thus less exposed.
There's definitely no significant work on patterns inside TOR. Someone may have looked at that, but I don't remember seeing anything about it.
jengolbeck1 karma
I do Twitter, too. Lots of data could be used for this kind of stalking, but Facebook and Twitter have been the main focus of academic researchers.
jengolbeck2 karma
You're doing above and beyond all the typical recommendations I would give. There's not much to collect on you.
I know I keep repeating this, but Dragnet Nation that was published a month or two ago really talks about a lot of this and how to stay off the data grid.
I mentioned DoNotTrackMe. I'm not sure that will catch more than what you have, but it's a nice option. Also, find a plugin to block google analytics scripts if you don't have one yet. So many sites use that, and it can allow a lot of your browsing history to be reconstructed. (I have no insight into whether google is doing that, but I don't like the idea that it could be done)
breathe241 karma
What's a good way to start learning this for a physics researcher? (In-depth, not the layman's tour.)
jengolbeck1 karma
Laszlo Barabasi is a physicist who has done work in this network science space. Starting with his research would be a great place to jump in.
jengolbeck1 karma
I never got into the old Bond movies, so I like the new ones better. I loved Skyfall, but I think Casino Royale is probably my favorite.
sfiddles1 karma
Supposedly the Pentagon has a zombie apocalypse emergency plan - http://www.iflscience.com/health-and-medicine/pentagon-has-zombie-apocalypse-emergency-plan
As a self-described tracker of the zombie apocalypse, do you have a plan in place?
jengolbeck1 karma
Yes! I have an apocalypse bag (for zombies or other emergencies that could beset the DC area where fleeing would be good). It has a change of clothes, a bit of cash, first aid kit, food, medicine, and camping gear (axe, knife, flashlight, crank radio, etc.)
We keep water in the car so, if I had to leave, I'd just grab the bag and go. We had an earthquake here in DC in the middle of the night one time. It woke me up and I remember sleepily thinking to myself "Nuclear detonation or earthquake? Maybe I should get the apocalypse bag and head out." Fortunately, it was no big deal, but I was ready!


bobthebobd112 karma
If you guys know so much, why do you keep showing me advertisements of water heaters after I already bought one?
View HistoryShare Link