
At 1 a.m. on May 7, the website PatientsLikeMe.com noticed suspicious activity on its 'Mood' discussion board. There, people exchange highly personal stories about their emotional disorders, ranging from bipolar disease to a desire to cut themselves.


It was a break-in. A new member of the site, using sophisticated software, was 'scraping,' or copying, every single message off PatientsLikeMe's private online forums.


PatientsLikeMe managed to block and identify the intruder: Nielsen Co., the privately held New York media-research firm. Nielsen monitors online 'buzz' for clients, including major drug makers, which buy data gleaned from the Web to get insight from consumers about their products, Nielsen says.

PatientsLikeMe设法阻拦了这名入侵者并确认了其身份:尼尔森公司(Nielsen Co.),一家私人持股的纽约媒体搜索公司。尼尔森公司说,他们为包括大型制药企业在内的客户监听在线交流内容,这些客户购买从网上搜集的数据,以了解消费者对其产品的看法。

'I felt totally violated,' says Bilal Ahmed, a 33-year-old resident of Sydney, Australia, who used PatientsLikeMe to connect with other people suffering from depression. He used a pseudonym on the message boards, but his PatientsLikeMe profile linked to his blog, which contains his real name.

家住澳大利亚悉尼市、今年33岁的艾哈迈德(Bilal Ahmed)说,我感觉自己的隐私受到了严重侵犯。艾哈迈德通过PatientsLikeMe同其他受抑郁症困扰的人进行联络。他在留言板上用的是网名,但是他的PatientsLikeMe账户数据与他的博客相连,那上面有他的真名。

After PatientsLikeMe told users about the break-in, Mr. Ahmed deleted all his posts, plus a list of drugs he uses. 'It was very disturbing to know that your information is being sold,' he says. Nielsen says it no longer scrapes data from private message boards.


The market for personal data about Internet users is booming, and in the vanguard is the practice of 'scraping.' Firms offer to harvest online conversations and collect personal details from social-networking sites, resume sites and online forums where people might discuss their lives.


The emerging business of web scraping provides some of the raw material for a rapidly expanding data economy. Marketers spent $7.8 billion on online and offline data in 2009, according to the New York management consulting firm Winterberry Group LLC. Spending on data from online sources is set to more than double, to $840 million in 2012 from $410 million in 2009.

网络信息搜集这种新兴业务为迅速扩张的数据经济提供了良好的支撑。根据纽约管理咨询公司Winterberry Group公布的数据,2009年,营销人员购买在线数据和离线数据花费的金额为78亿美元。2012年购买在线数据花费的金额将是2009年的两倍多,从4.10亿美元增长到8.40亿美元。

The Wall Street Journal's examination of scraping -- a trade that involves personal information as well as many other types of data -- is part of the newspaper's investigation into the business of tracking people's activities online and selling details about their behavior and personal interests.

《华尔街日报》(The Wall Street Journal)调查了跟踪人们在线活动并出售有关其行为和个人兴趣详细信息的业务,信息搜集──该业务涉及个人信息,也涉及其它多类数据--是该项调查的一个部分。

Some companies collect personal information for detailed background reports on individuals, such as email addresses, cell-phone numbers, photographs and posts on social-network sites.


Others offer what are known as listening services, which monitor in real time hundreds or thousands of news sources, blogs and websites to see what people are saying about specific products or topics.


One such service is offered by Dow Jones & Co., publisher of the Journal. Dow Jones collects data from the Web, which may include personal information, that help determine how corporate clients are covered in news articles, blogs and discussion boards. It says it doesn't gather information from password-protected parts of websites.

《华尔街日报》的出版商道琼斯公司(Dow Jones & Co.)就提供此类服务。道琼斯公司从网络上收集信息--其中可能也包括个人信息,用来帮助确定新闻报导、博客和讨论板上企业客户的覆盖情况。道琼斯公司称,该公司不会从网站上受密码保护的部分搜集信息。

The competition for data is fierce. PatientsLikeMe also sells data about its users. PatientsLikeMe says the data it sells is anonymized, no names attached.


Nielsen spokesman Matt Anchin says the company's reports to its clients include publiclyavailable information gleaned from the Internet, 'so if someone decides to share personally identifiable information, it could be included.'

尼尔森公司的发言人安尚(Matt Anchin)说,该公司向客户提供的报告包括可以从互联网上公开搜集到的信息,因此如果有人决定共享那些可识别其身份的信息,那么这种信息就可能包括在其中。

Internet users often have little recourse if personally identifiable data is scraped: There is no national law requiring data companies to let people remove or change information about themselves, though some firms let users remove their profiles under certain circumstances.


California has a special protection for public officials, including politicians, sheriffs and district attorneys. It makes it easier for them to remove their home address and phone numbers from these databases, by filling out a special form stating they fear for their safety.


Data brokers long have scoured public records, such as real-estate transactions and courthouse documents, for information on individuals. Now, some are adding online information to people's profiles.


Many scrapers and data brokers argue that if information is available online, it is fair game, no matter how personal.


'Social networks are becoming the new public records,' says Jim Adler, chief privacy officer of Intelius Inc., a leading paid people-search website. It offers services that include criminalbackground checks and 'Date Check,' which promises details about a prospective date for $14.95.

一家名为Intelius Inc.的著名付费个人信息搜索网站的首席隐私长阿德勒(Jim Adler)说,社交网站成为了新的公共记录。该网站提供的服务包括犯罪背景调查和"约会对象侦测器",关于一位潜在约会对象的详细信息售价14.95美元。

'This data is out there,' Mr. Adler says. 'If we don't bring it to the consumer's attention, someone else will.'


New York-based PeekYou LLC has applied for a patent for a method that, among other things, matches people's real names to the pseudonyms they use on blogs, Twitter and other social networks. PeekYou's people-search website offers records of about 250 million people, primarily in the U.S. and Canada.


PeekYou says it also is starting to work with listening services to help them learn more about the people whose conversations they are monitoring. It says it hands over only demographic information, not names.


Employers, too, are trying to figure out how to use such data to screen job candidates. It's tricky: Employers legally can't discriminate based on gender, race and other factors they may glean from social-media profiles.


One company that screens job applicants for employers, InfoCheckUSA LLC in Florida, began offeringlimited social-networking data -- some of it scraped -- to employers about a year ago. 'It's slowly starting to grow,' says Chris Dugger, national accountmanager. He says he's particularly interested in things like whether people are 'talking about how they just ripped off their last employer.'

佛罗里达州一家名为InfoCheckUSA的公司为雇主提供筛选求职者的服务,他们在大约一年前开始向雇主提供有限制的社交网站数据--其中一些是该公司搜集得来的。该公司全国客户经理达格尔(Chris Dugger)说,这种业务如今开始缓慢增长。他说,他对人们是否会谈论他们如何欺骗上一任老板之类的话题尤其感兴趣。

Scrapers operate in a legal gray area. Internationally, anti-scraping laws vary. In the U.S., court rulings have been contradictory. 'Scraping is ubiquitous, but questionable,' says Eric Goldman, a law professor at Santa Clara University. 'Everyone does it, but it's not totally clear that anyone is allowed to do it without permission.'

信息搜集机构游走在法律的灰色地带。各个国家的反信息搜集法律五花八门,无一定之规。在美国,不同法院对此类案例的判决互相矛盾。圣克拉拉大学(Santa Clara University)的法学教授戈德曼(Eric Goldman)说,信息搜集普遍存在,但其正当性值得商榷。大家都已经在做了,但是否任何人都可以在不经允许的情况下这么做尚不明确。

Scrapers and listening companies say what they're doing is no different from what any person does when gathering information online -- they just do it on a much larger scale.


'We take an incomprehensible amount of information and make it intelligent,' says Chase McMichael, chief executive of Infinigraph, a Palo Alto, Calif., 'listening service' that helps companies understand the likes and dislikes of online customers.

加利福尼亚州帕洛阿尔托(Palo Alto)Infinigraph公司的执行总裁麦克迈克尔(Chase McMichael)说,我们搜集大量信息,并将其加以巧妙运用。这家公司是帮助企业了解在线客户好恶的侦听服务机构。

Scraping services range from dirt cheap to custom-built. Some outfits, such as 80Legs.com in Texas, will scrape a million Web pages for $101. One Utah company, screen-scraper.com, offers do-it-yourself scraping software for free. The top listening services can charge hundreds of thousands of dollars to monitor and analyze Web discussions.


Some scrapers-for-hire don't ask clients many questions.


'If we don't think they're going to use it for illegal purposes -- they often don't tell us what they're going to use it for -- generally, we'll err on the side of doing it,' says Todd Wilson, owner of screen-scraper.com, a 10-person company in Provo, Utah, that operates out of a two-room office. It is one of at least three firms in a scenic area known locally as 'Happy Valley' that specialize in scraping.

screen-scraper.com公司的老板托德•威尔逊(Todd Wilson)说,如果我们认为他们不会将信息用于非法用途--他们通常是不会告诉我们这些信息的用途的--那么,我们通常会犯下大错。这家公司只有10名员工,在一间有两个房间的办公室里办公。在这个当地称为"快乐谷"的风景区中,至少有三家专门从事信息搜集的公司,screen-scraper.com就是其中之一。

Screen-scraper charges between $1,500 and $10,000 for most jobs. The company says it's often hired to conduct 'business intelligence,' working for companies who want to scrape competitors' websites.


One recent assignment: A major insurance company wanted to scrape the names of agents working for competitors. Why? 'We don't know,' says Scott Wilson, the owner's brother and vice president of sales. Another job: attempting to scrapeFacebook for a multi-level marketing company that wanted email addresses of users who 'like' the firm's page -- as well as their friends -- so they all could be pitched products.

最近该公司接下的一笔业务是:一家大型保险公司希望搜集为其竞争者工作的代理商的名录。为什么?老板托德•威尔逊的弟弟兼销售副总裁斯科特•威尔逊(Scott Wilson)说,我们不知道。他们的另一单业务是:为一家多层次营销公司搜集Facebook上的信息,这家公司想要"喜欢"该公司网页的用户的电邮地址--以及他们朋友的电邮地址--这样他们就能有针对性地进行产品推销。

Scraping often is a cat-and-mouse game between websites, which try to protect their data, and the scrapers, who try to outfox their defenses. Scraping itself isn't difficult: Nearly any talentedcomputer programmer can do it. But penetrating a site's defenses can be tough.

信息搜集往往是网站和信息搜集公司之间猫捉耗子的游戏,前者努力保护数据,而后者则努力击破他们的防火 。信息搜集本身并不难:几乎任何有能力的计算机程序员都做得到。但穿透网站的防火 却可能很难。

One defense familiar to most Internet users involves 'captchas,' the squiggly letters that many websites require people to type to prove they're human and not a scraping robot. Scrapers sometimes fight back with software that deciphers captchas.

多数网络用户都很熟悉的一个防火 就是"验证码"(captchas),许多网站都要求人们键入这种歪歪扭扭的字符,以证明他们是人类,而不是信息搜集机器。信息搜集公司有时用能破译验证码的软件予以还击。

Some professionalscrapers stage blitzkrieg raids, mounting around a dozen simultaneous attacks on a website to grab as much data as quickly as possible without being detected or crashing the site they're targeting.


Raids like these are on the rise. 'Customers for whom we were regularly blocking about 1,000 to 2,000 scrapes a month are now seeing three times or in some cases 10 times as much scraping,' says Marino Zini, managing director of Sentor Anti Scraping System. The company's Stockholm team blocks scrapers on behalf of website clients.

这种袭击有愈演愈烈之势。斯德哥尔摩的Sentor Anti Scraping System公司为网站客户提供拦截信息搜集行为的服务,该公司总经理齐尼(Marino Zini)说,我们以前通常每月为客户拦截1,000至2,000次信息搜集,而现在这一数字是原来的3倍,有些情况下甚至是10倍。

Julia Angwin / Steve Stecklow