Report on the Investigative Data Warehouse
Table Of Contents
- Overview of the IDW
- IDW Systems Architecture
- Privacy Impact Assessment
- The Future of the IDW is Data Mining
In August 2006, the Electronic Frontier Foundation (EFF) sought government records concerning the Federal Bureau of Investigation (FBI)’s Investigative Data Warehouse (IDW) pursuant to the Freedom of Information Act (FOIA). After the FBI failed to respond to EFF’s requests within the timeline provided by the FOIA, EFF filed a lawsuit on October 17, 2006. Records began to arrive in September 2007. On April 14, 2009, the government filed a brief stating that no more documents were going to be provided, despite the Obama Administration’s new guidelines on FOIA.
The following report is based upon the records provided by the FBI, along with public information about the IDW and the datasets included in the data warehouse.
I. Overview of the Investigative Data Warehouse
The Investigative Data Warehouse is a massive data warehouse, which the Bureau describes as “the FBI’s single largest repository of operational and intelligence information.” As described by FBI Section Chief Michael Morehart in 2005, the “IDW is a centralized, web-enabled, closed system repository for intelligence and investigative data.” Unidentified FBI agents have described it “one-stop shopping” for FBI agents and an “uber-Google.” According to the FBI, “[t]he IDW system provides data storage, database management, search, information presentation, and security services.”
Documents show that the FBI began spending funds on the IDW in fiscal year 2002, “and system implementation was completed in FY 2005.” “IDW 1.1 was released in July 2004 with enhanced functionality, including batch processing capabilities.” The FBI worked with Science Applications International Corporation (SAIC), Convera and Chilliad to develop the project, among other contractors. As of January 2005, the IDW contained “more than 47 sources of counterterrorism data, including information from FBI files, other government agency data, and open source news feeds.” A chart in the FBI documents shows IDW growing rapidly, breaking the half-billion mark in 2005. By March 2006, the IDW had 53 data sources and over half a billion (587,186,453) documents. By September 2008, the IDW had grown to nearly one billion (997,368,450) unique documents. The Library of Congress, by way of comparison, has about 138 million (138,313,427) items in its collection.
In addition to storing vast quantities of data, the IDW provides a content management and data mining system that is designed to permit a wide range of FBI personnel (investigative, analytical, administrative, and intelligence) to access and analyze aggregated data from over fifty previously separate datasets included in the warehouse. Moving forward, the FBI intends to increase its use of the IDW for “link analysis” (looking for links between suspects and other people – i.e. the Kevin Bacon game) and to start “pattern analysis” (defining a “predictive pattern of behavior” and searching for that pattern in the IDW’s datasets before any criminal offence is committed – i.e. pre-crime).
II. IDW Systems Architecture
According to an FBI project description, “The IDW system environment consists of a collection of UNIX and NT servers that provide secure access to a family of very large-scale storage devices. The servers provide application, web servers, relational database servers, and security filtering servers. User desktop units that have access to FBINet can access the IDW web application. This provides browser-based access to the central databases and their access control units. The environment is designed to allow the FBI analytic and investigative users to access any of the data sources and analytic capabilities of the system for which they are authorized. The entire configuration is designed to be scalable to enable expansion as more data sources and capabilities are added.”
A DOJ Inspector General report explained: “Data processing is conducted by a combination of Commercial-Off-the-Shelf (COTS) applications, interpreted scripts, and open-source software applications. Data storage is provided by several Oracle Relational Database Management Systems (DBMS) and in proprietary data formats. Physical storage is contained in Network Attached Storage (NAS) devices and component hard disks. Ethernet switches provide connectivity between components and to FBI LAN/WAN. An integrated firewall appliance in the switch provides network filtering.”
Pursuant to the IDW Concept of Operations, the IDW has two main subsystems, the IDW-Secret (IDW-S) and IDW-Special Projects Team (IDW-SPT). It also has a development platform (IDW-D) and a subsystem for maintenance and testing (IDW-I).
The IDW-S system is the main subsystem of the IDW, which is authorized to process classified national security data up to, and including, information designated Secret. However, IDW-S is not authorized to process any Top Secret data nor any Sensitive Compartmented Information (SCI). The addition of IDW-TS/SCI, a Top Secret/Sensitive Compartmented Information level data mart, appears to remain in the planning stages. The IDW-S system is the successor of the Secure Counter-Terrorism/Collaboration Operational Prototype Environment (SCOPE).
IDW-Special Projects Team
According to an Inspector General report, “[i]n November 2003, the Counterterrorism Division, along with the Terrorist Financing Operations Section (TFOS), in the FBI began a special project to augment the existing IDW system with new capabilities for use by FBI and non-FBI agents on the JTTFs. The FBI Office of Intelligence is the executive sponsor of the IDW. The IDW Special Projects Team was originally initiated for the 2004 Threat Task Force.” By May 2006, the “Special Project Team provided services to 5 task forces or operations.”
Special Projects Team (SPT) Subsystem
The Special Projects Team (SPT) Subsystem allows for the rapid import of new specialized data sources. These data sources are not made available to the general IDW users but instead are provided to a small group of users who have a demonstrated “need-to-know”. The SPT System is similar in function to the IDW-S system. With the main difference is a different set of data sources. The SPT System allows its users to access not only the standard IDW Data Store but the specialized SPT Data Store.
In 2004, the Willie Hulon, then the Deputy Assistant Director for the Counterterrorism Division, said that the FBI was “introducing advanced analytical tools to help us make the most of the data stored in the IDW. These tools allow FBI agents and analysts to look across multiple cases and multiple data sources to identify relationships and other pieces of information that were not readily available using older FBI systems. These tools 1) make database searches simple and effective; 2) give analysts new visualization, geo-mapping, link-chart capabilities and reporting capabilities; and 3) allow analysts to request automatic updates to their query results whenever new, relevant data is downloaded into the database.”
Deputy Assistant Director Hulon also asserted that “[w]hen the IDW is complete, Agents, JTTF [Joint Terrorism Task Force] members and analysts, using new analytical tools, will be able to search rapidly for pictures of known terrorists and match or compare the pictures with other individuals in minutes rather than days. They will be able to extract subjects’ addresses, phone numbers, and other data in seconds, rather than searching for it manually. They will have the ability to identify relationships across cases. They will be able to search up to 100 million pages of international terrorism-related documents in seconds.” (Since then, the number of records has grown nearly ten-fold).
At the FBI National Security Branch’s “request, the FBI’s Office of the Chief Technology Officer (OCTO) has developed an ‘alert capability’ that allows users of IDW to create up to 10 queries of the system and be automatically notified when a new document is uploaded to the database that meets their search criteria.”
“Users can search for terms within a defined parameter of one another. For example, the search: ‘flight school’ NEAR/10 ‘lessons’ would return all documents where the phrase ‘flight school’ occurred within 10 words of the word “lessons.” Users can also specify whether they want exact searches, or if they want the search tool to include other synonyms and spelling variants for words and names.”
“IDW includes the ability to search across spelling variants for common words, synonyms and meaning variants for words, as well as common misspellings of words. If a user misspells a common word, IDW will run the search as specified, but will prompt the user to ask if they intended to run the search with the correct spelling.”
In its 2004 report to the 9-11 Commission, the FBI used an example (shown on the right) to illustrate the planned use of the IDW for data mining and link analysis, showing i2’s Analyst’s Notebook. i2 described the program as “the world’s most powerful visual investigative analysis software,” which is able to analyze “vast amounts of raw, multi-format data gathered from a wide variety of sources.”
By 2006, the IDW was processing between 40,000 and 60,000 “interactive transactions” in any given week, along with between 50 and 150 batch jobs. An example of a batch process is where “the complete set of Suspicious Activity Reports is compared to the complete set of FBI terrorism files to identify individuals in common between them.”
Datasets in the IDW
Automated Case System (ACS), Electronic Case File (ECF). This dataset contains ASCII flat files (metadata and document text) and WordPerfect documents consisting of the ECs, FD-302s, Facsimiles, FD-542s, Inserts, Transcriptions, Teletypes, Letter Head Memorandums (LHM), Memorandums and other FBI documents contained within ACS. The ACS system, which came on-line in October 1995, is the FBI’s centralized electronic case management system. It consists of the following components:
- Investigative Case Management — used to open a case and assign a unique 9-digit case number, called the Universal Case File Number, which consists of the FBI crime classification number; a two-letter alpha code designating the field office that opened the case; and a consecutive, numerical designator generated by the system.
- Electronic Case File — used to maintain investigative documentation, such as interview transcripts. Upon approval of a paper document, an electronic copy of the completed document is uploaded to the electronic case file.
- Universal Index — used to maintain index records for a case and allows the searching of records in a variety of ways.
[NOTE: While ACS is the current FBI case file system, it may soon be replaced. The FBI originally intended to replace ACS with the “Virtual Case File” system. After what the Office of the Inspector General called “FBI’s failed $170 million VCF project,” the FBI now “plans to replace the ACS system with the Sentinel Case Management System. The projected implementation date is 2009.” “When up and running, Sentinel will provide more current case information, audio, video, pictures and multimedia into the IDW system.”]
- Investigative Case Management — used to open a case and assign a unique 9-digit case number, called the Universal Case File Number, which consists of the FBI crime classification number; a two-letter alpha code designating the field office that opened the case; and a consecutive, numerical designator generated by the system.
- Secure Automated Messaging Network (SAMNet) — ASCII files in standard cable traffic message format (all capitals with specific header), consisting of all messaging traffic sent either from the FBI to other government agencies, or sent from other government agencies to the FBI through the Automated Digital Information Network (AutoDIN), including Intelligence Information Reports (IIRs) and Technical Disseminations (TD) from the FBI, Central Intelligence Agency (CIA), Defense Intelligence Agency (DIA), and others from November of 2002 to present. IDW receives copies of these classified messages up to Secret with no SCI caveats.
- Joint Intelligence Committee Inquiry (JICI) Documents — Scanned copies (TIFF images and ASCII OCR text) of “all FBI documents related to extremist Islamic terrorism between 1993 and 2002.” These are counterterrorism files that were scanned into a database to accommodate the JICI’s investigation into the attacks of September 11th.
- Open Source News — Includes various foreign news sources that have been translated into English, as well as a few large U.S. publications. The open source data collected for the FBI comes from the MiTAP system run by San Diego State University. MiTAP is a system that collects raw data from the internet, standardizes the format, extracts named entities, and routes documents into appropriate newsgroups. This dataset is part of the Defense Advanced Research Projects Agency (DARPA) Translingual Information Detection, Extraction and Summarization (TIDES) Open Source Data project.
- Violent Gang and Terrorist Organization File (VGTOF) — Lists of individuals and organizations who the FBI believes to be associated with violent gangs and terrorism, provided by the FBI National Crime Information Center (NCIC). It includes biographical data and photos pertaining to members of the identified groups in the form of ASCII flat files (data/metadata) and JPEG image binaries (none, one or multiple per subject). The biographical data includes the “individual’s name, sex, race, and group affiliation, and, if possible, such optional information as height and weight; eye and hair colors; date and place of birth; and marks, scars, and tattoos.”
- CIA Intelligence Information Reports (IIR) and Technical Disseminations (TD) — A copy of all IIRs and TDs at the Secret security classification or below that were sent to the FBI from 1978 to at least May 2004. Intelligence Information Reports are designed to provide the FBI with the specific results of classified intelligence collected on internationally-based terrorist suspects and activities, chiefly abroad.
- Eleven (11) IntelPlus scanned document libraries — Copies of millions of scanned TIFF format documents and their corresponding OCR ASCII text related to FBI’s major terrorism-related cases. IntelPlus is an application that allows the users to view “Table of Contents” lists from large collections of records. The user is able to display the document whether it is in text form or one of several graphic formats and then print, copy or store the information. The application allows tracking associated documents on related topics and provides a search capability.
- Eleven (11) Financial Crimes Enforcement Network (FinCEN) Databases — Data related to terrorist financing. “FinCEN requires financial institutions to preserve financial paper trails behind transactions and to report suspicious transactions to FinCEN for its database. FinCEN matches its database with commercial databases such as Lexis/Nexis and the government’s law enforcement databases, allowing it to search for links among individuals, banks, and bank accounts.” At least one of these databases includes all currency transaction report (CTR) forms on bank customers’ cash transactions of more than $10,000: “In 2004, FinCEN first provided the FBI with bulk transfer of [CTRs]” Over 37 million CTRs were filed between 2004-2006.
- Two (2) Terrorist Financing Operations Section Databases — Biographical and financial reports on terrorism-related individuals. According to Dennis Lormel, Section Chief of the Terrorist Financing Operations Section, TFOS has a “centralized terrorist financial database which the TFOS developed in connection with its coordination of financial investigation of individuals and groups who are suspects of FBI terrorism investigations. The TFOS has cataloged and reviewed financial documents obtained as a result of numerous financial subpoenas pertaining to individuals and accounts. These documents have been verified as being of investigatory interest and have been entered into the terrorist financial database for linkage analysis. The TFOS has obtained financial information from FBI Field Divisions and Legal Attache Offices, and has reviewed and documented financial transactions. These records include foreign bank accounts and foreign wire transfers.”
- Foreign Financial List — Copies of information concerning terrorism-related persons, addresses, and other biographical data submitted to U.S. financial institutions from foreign financial institutions.
- Selectee List — Copies of a Transportation Security Administration (TSA) list of individuals that the TSA believes warrant additional security attention prior to boarding a commercial airliner. According to Michael Chertoff, “fewer than” 16,000 people were designated “selectees” as of October 2008.
- Terrorist Watch List (TWL) — The FBI Terrorist Watch and Warning Unit (TWWU) list of names, aliases, and biographical information regarding individuals submitted to the Terrorist Screening Center (TSC) for inclusion into VGTOF and TIPOFF watch lists. Also called the Terrorist Screening Database (TSDB), the database “contained a total of 724,442 records as of April 30, 2007.”
- No Fly List — A copy of a TSA list of individuals barred from boarding a commercial airplane. According to Michael Chertoff, 2,500 people were on the “no fly” list as of October 2008.
- Universal Name Index (UNI) Mains — A copy of index records for all main subjects on FBI investigations, except certain records that might reveal people in witness protection or informants. “A main file name is that of an individual who is, himself/herself, the subject of an FBI investigation.”
- Universal Name Index (UNI) Refs — A copy of index records for all individuals referenced in FBI investigations, except certain records that might reveal people in witness protection or informants. A “reference is someone whose name appears in an FBI investigation. References may be associates, conspirators, or witnesses.”
- Department of State Lost and Stolen Passports — A copy of records pertaining to lost and stolen passports. “The Consular Lost and Stolen Passports (CLASP) database includes over 1.3 million records concerning U.S. passports. All passport applications are checked against CLASP, PIERS [Passport Information Electronic Records System], the Social Security Administration’s database, and the Consular Lookout and Support System (CLASS), which includes information provided by the Department of Health and Human Services (HHS) and law enforcement agencies such as the Federal Bureau of Investigations (FBI) and U.S. Marshals Service.” “The overall CLASS database of names has risen to over 20 million records in recent years, including millions of names of criminals from FBI records provided to the State Department under the terms of the USA PATRIOT Act.” “The Online Passport Lost & Stolen System permits citizens to report a lost or stolen passport.” It includes “Name, date of birth (DOB), social security number (SSN), address, telephone number, and e-mail address,” as reported by the citizen.
- Department of State Diplomatic Security Service — A copy of past and current passport fraud investigations from the “DOS DDS RAMS database.” The Records Analysis Management System (RAMS) Database “allows all Field Offices, Resident Agent Offices (RAO) and the Bureau of Diplomatic Security to track, maintain, and efficiently share law enforcement investigative case information. RAMS contains CLASSIFIED information.” By September 2005, the Department of States was “developing a ‘Knowledge Base’ on-line library that will be a ‘gateway’ to passport information, anti-fraud information, and relevant databases. All passport field agencies and centers can use this system to submit anti-fraud information such as exemplars of genuine and malafide documents, fraud trends in their respective regions, and other information that will be instantly available throughout the department.”
In August 2004, the FBI was considering adding several more datasets: the “FBI’s Telephone Application, DHS data sources such as US-VISIT and SEVIS, Department of State data sources such as the Consular Consolidated Database (CCD), and Treasury Enforcement Communication System (TECS).” A later document shows that at least “most” of the Telephone Application is now in the IDW.
The Telephone Application (TA) “provides a central repository for telephone data obtained from investigations.” “The TA is an investigative tool that also serves as the central repository for all telephone data collected during the course of FBI investigations. Included are pen register data, toll records, trap/trace, tape-edits, dialed digits, airnet (pager intercepts), cellular activity, push-to-talk, and corresponding subscriber information.” Records obtained through National Security Letters are placed in the Telephone Application, as well as the IDW by way of the ACS system.
“The United States Visitor and Immigrant Status Indicator Technology (US-VISIT) Program is an integrated, automated biometric entry-exit system that records the arrival and departure of aliens; conducts certain terrorist, criminal, and immigration violation checks on aliens; and compares biometric identifiers to those collected on previous encounters to verify identity.”
The Consular Consolidated Database (CCD) is a set of databases that includes “current and archived data from all of the Department of State’s Consular Affairs post databases around the world. This includes the data from the Automated Biometric Identification System (ABIS), ARCS, Automated Cash Register System (ACS), Consular Lookout and Support System (CLASS), Consular Shared Tables (CST), DataShare, Diversity Visa Information System (DVIS), Immigrant Visa Information System (IVIS), Immigrant Visa Overseas (IVO), Non-Immigrant Visa (NIV), Visa Opinion Information Service (VOIS), and Waiver Review System (WRS) applications. The CCD also provides access to passport data in the Travel Document Information System (TDIS), Passport Lookout and Tracking System (PLOTS), and Passport Information Electronic Records System (PIERS). In addition to Consular Affairs data, other data from external agencies is integrated into the CCD, such as the ‘Master Death Database from the Social Security Administration.”
The Student and Exchange Visitor Information System (SEVIS) “maintains information on nonimmigrant students and exchange visitors (F, M and J Visas) and their dependents, and also on their associated schools and sponsors.”
The Treasury Enforcement Communication System (TECS) “is a computerized information system designed to identify individuals and businesses suspected of, or involved in violation of federal law. The TECS is also a communications system permitting message transmittal between Treasury law enforcement offices and other Federal, national, state, and local law enforcement agencies.”
Unidentified Additional Data Sources Added to IDW
The FBI set up an Information Sharing Policy Group (ISPG), chaired by the Executive Assistant Directors of Administration and Intelligence, to review requests to ingest additional datasets into the IDW, in response to Congressional “privacy concerns that may arise from FBI engaging in ‘data mining.'” In February 2005, the Counterterrorism Division asked for 8 more data sources. While the names of the data sources are redacted, items 1, 2 and 4 came from the Department of Homeland Security, and items 6, 7 and 8 were additional IntelPlus file rooms. The February 2005 email chain also refers to “2 data sets approved at the meeting yesterday” and “2 data set under consideration.” In context, it appears that one of the two approved datasets was IntelPlus, which contained three file rooms. The FBI would “get all of the DHS data from the FTTTF [Foreign Terrorist Tracking Task Force] including the [Redacted].” In March 2005, the Information Sharing Policy Group approved seven more unidentified datasets for the Special Projects Team version of the IDW. In May 2005, ISPG approved an additional seven unidentified datasets for the IDW-SPT. The IDW Special Projects Team “ingested and published a new telephone-type data source” on two dates: February 18, 2005, and March 18, 2005. In August 2005, the “[Redacted] Reports Collection” was moved from the limited access IDW-SPT to the more widely available IDW-S. “This [Redacted] dataset contains copies of reports regarding [Redacted].”
There is no current Disposition Schedule for IDW. We have looked at the system and it is on our list of systems to be scheduled. With no Disposition Schedule, there is really no limitation on importing data, at least not from a records management standpoint. But, they will not be able to delete or destroy any of that information until a Disposition Schedule is approved.
Nevertheless, the IDW has a process to delete files: “it can occur that data for which IDW-S is not authorized is ingested into IDW-S. When such data is discovered on IDW-S it is necessary to delete this data and to update the Document Tracking Database with the appropriate “DEL” status for the file.” The IDW also has a “secure delete” function.
III. Privacy Impact Assessment
The E-Government Act of 2002, Section 208, establishes a requirement for agencies to conduct privacy impact assessments (PIAs) for electronic information systems and collections.
A May 12, 2005 email from an unidentified employee in the FBI’s Office of the General Counsel to FBI General Counsel Valerie Caproni notes that the author was “nervous about mentioning PIA in context of national security systems.” The author admitted that “It is true the FBI currently requires PlAs for NS [national security] systems as well as non-NS systems.” However, the author thought that the policy might change. Accordingly the author “recommend[ed] against raising congressional consciousness levels and expectations re NS PlAs.” Caproni’s response is short: “ok.”
This email was in reply to a May 11 email from Caproni expressing her desire “slide something in about PIA” to a give a “sense that we really do worry about the privacy interests of uninvolved people whose data we slurp up.”
However, this strategy failed. Congressional consciousness levels were raised by an August 30, 2006 Washington Post article on the IDW, in which EFF Senior Counsel David Sobel raised the issue of the IDW’s lack of a formally published PIA.
The day the Post article ran, several FBI emails discussed the privacy concerns raised by the IDW. One Office of the General Counsel employee (only identified as Bill) explained the FBI’s desire to play down the concerns: “I’m with [Redacted] in view that if everyone ([Redacted]) starts running around with their hair on fire on this, they will just be pouring gas on something that quite possibly would just fade away if we just shrug it off.”
After these discussions, the FBI released the following response to the article:
Federal Bureau of Investigation
Response to Investigative Data Warehouse (IDW) Press Article for Senate Appropriations Committee
September 7, 2006
There are two concerns being expressed about IDW in the article. One deals with whether the FBI has complied with the Privacy Act’s requirement to publish a “systems notice” in the Federal Register and the other is whether the FBI has complied with the privacy impact analysis requirements of the “E-Government Act.”
The answer to the first question is “yes.” We consider IDW to be part of the FBI’s Central Record System, an “umbrella” system that is comprised of all of the FBI’s investigative files. While it is true that “IDW” isn’t specifically mentioned in the CRS Privacy Act System Notice, we don’t believe that is necessary. The system notice does state: “In recent years … the FBI has been confronted with increasingly complicated cases, which require more intricate information processing capabilities. Since these complicated investigations frequently involve massive volumes of evidence and other investigative information, the FBI uses its computers, when necessary to collate, analyze, and retrieve investigative information in the most accurate and expeditious manner possible.” The system notice describes in reasonable detail what information we obtain, what routine uses we make of it, the authorities for maintaining the system and so forth. This notice is published in the Federal Register and is publicly available. In our view, we are compliant with both the letter and spirit of the Privacy Act in this regard.
The answer to the second question is also “yes.” In fact, since IDW has been categorized as a “national security system,” the E-Government Act does not require it to undergo a privacy impact analysis (PIA) at all. Even so, FBI and DOJ policy requires a PIA to be conducted. For IDW, the FBI has done several PIA’s. We did one for the original system and did others as significant datasets were added to IDW. None of these systems were published since the law does not require them to be conducted in the first place. The point is that we have done far more to analyze the privacy implications of IDW than the law requires. Yes, the analyses have not been conducted in the public domain but Congress weighed the costs and benefits of conducting such an analysis in public and chose to exclude national security systems from that requirement when it passed the E- Government act.
For purposes of the E-Government Act, a National Security System is “an information system operated by the federal government, the function, operation or use of which involves: (a) intelligence activities, (b) cryptologic activities related to national security, (c) command and control of military forces, (d) equipment that is an integral part of a weapon or weapons systems, or (e) systems critical to the direct fulfillment of military or intelligence missions.”
A heavily redacted March 2005 FBI Electronic Communication enclosed a completely redacted Privacy Impact Assessment about the IDW. In August 2007, the Office of the Inspector General conducted an audit of “all major Department [of Justice] information technology (IT) systems and planned initiatives.” The OIG noted that it “did not obtain PIAs or explanations for the FBI’s IDW.”
IV. The Future of the IDW is Data Mining
When the FBI explained the IDW to Congress in 2004, it noted that when FBI Director Mueller testified about the IDW in 2003, he “used the term ‘data mining’ to be synonymous with ‘advanced analysis.’ The FBI does not conduct ‘data mining’ in accordance with the GAO definition, which means mining through large volumes of data with the intention of automatically predicting future activities.”
Nevertheless, in March 2003, the FBI issued its Fiscal Year 2004 (Oct. 2003 – Sep. 2004) budget, in which the Bureau had requested a new “Communications Application”:
The FBI requests $4,600,000 to obtain a software application that is capable of conducting sophisticated link analysis on extremely high volumes of telephone toll call data and other relational data. This software would enable the FBI to leverage modern technology to expeditiously conduct analyses of large collections of relational data.
By 2005, the FBI was still trying to minimize Congressional concerns over data mining. The FBI was concerned that the “distinction between a data mart and a data mining vehicle will be lost on those who just think we are looking into citizens’ lives too much.” On March 1, 2005, an unidentified Office of Congressional Affairs (OCA) employee noted in an email (emphasis original):
We had agreed on the following sentence as a way of avoiding some of the intricacies of data mining policy: “Where permitted by law, and appropriate to an authorized work activity, information gleaned from searching non-FBI databases may be included in FBI systems and, once there, may be accessed by employees conducting searches in furtherance of other authorized activities.”
Unfortunately, I couldn’t get that to fly, since that was the crux of the Senator’s inquiry.
In October 2005 FBI emails discuss the response to the August 2005 GAO report on data mining by the Foreign Terrorist Tracking Task Force (FTTTF). “In 2001, Homeland Security Presidential Directive-2 established the Foreign Terrorist Tracking Task Force (FTTTF) to provide actionable intelligence to law enforcement to assist in the location and detention and ultimate removal of terrorists and their supporters from the US.” The FTTTF “operates two information systems—one unclassified and one classified—that form the basis of its data mining activities,” using tools such as i2 Analyst Notebook application, Query Tracking and Initiation Program (QTIP), and Wareman. In addition to the FBI, “the participants in the FTTTF include the Department of Defense, the Department of Homeland Security’s Bureaus of Immigration and Customs Enforcement and the Customs and Border Protection, the State Department, the Social Security Administration, the Office of Personnel Management, the Department of Energy, and the Central Intelligence Agency.”
In these 2005 emails, an OCA employee suggested a limitation on the scope of the FBI’s response to Congress: “Maybe we say that ‘FTTTF refers to an operational task force. We understand the question to ask about data mining initiatives of FTTTF.'”
Around the same time, an unidentified Office of the General Counsel employee wrote:
Finally – I’m concerned about the statement that we only have 3 data mining projects in the FBI. In the cover letter, you make the point that our definition of data mining only includes large sets of data but I still think the definition is very broad and could include other systems. For example, what about STAS systems? I am not familiar with those systems -(but we are starting work on a PIA so I will be in the near future) but my sense is that they collect and sift through a lot of data. What about EDMS and some of the other systems that collect tech cut data from FISAs and allow analysts to search through the data for relevant info? I would think that could be considered data mining under your definition – but I’ll defer to the CIO’s office on this issue. We just need to make sure we can distinguish these other projects.
A few years later, however, the FBI became less circumspect about marrying the data sets of the IDW with the data mining capabilities of the FTTTF. For the FBI’s FY2007 War Supplemental budget request, the FBI requested $10 million to consolidate the IDW and the FTTTF “and to develop and deploy a robust infrastructure capable of receiving, processing, and managing the quality of substantially increased amounts of additional data.
In its FY2008 “budget justification,” the FBI explained that “[t]he Investigative Data Warehouse (IDW), combined with FTTTF’s existing applications and business processes, will form the backbone of the NSB’s data exploitation system.” The FBI also requested “$11,969,000 … for the National Security Branch Analysis Center (NSAC).” It explains:
Once operational, the NSAC will be tasked to satisfy unmet analytical and technical needs of the NSB, particularly in the areas of bulk data analysis, pattern analysis, and trend analysis. … The NSAC will provide subject-based “link analysis” through the utilization of the FBI’s collection datasets, combined with public records on predicated subjects. “Link analysis” uses datasets to find links between subjects, suspects, and addresses or other pieces of relevant information, and other persons, places, and things. This technique is currently being used on a limited basis by the FBI; the NSAC will provide improved processes and greater access to this technique to all NSB components. The NSAC will also pursue “pattern analysis” as part of its service to the NSB. “Pattern analysis” queries take a predictive model or pattern of behavior and search for that pattern in datasets. The FBI’s efforts to define predictive models and patterns of behavior will improve efforts to identify “sleeper cells.”
“The National Security Analysis Center (NSAC) would bring together nearly 1.5 billion records created or collected by the FBI and other government agencies, a figure the FBI expects to quadruple in coming years.” In June 2007, after seeing this budget request and noting that “[d]ocuments predict the NSAC will include six billion records by FY2012,” the House Science and Technology Committee asked the Government Accountability Office to investigate the National Security Branch Analysis Center.
In 2008, the non-partisan National Research Council issued a 352-page study concluding that data mining is not an effective tool in the fight against terrorism. The report noted the poor quality of the data, the inevitability of false positives, the preliminary nature of the scientific evidence and individual privacy concerns in concluding that “automated identification of terrorists through data mining or any other mechanism is neither feasible as an objective nor desirable as a goal of technology development efforts.”
Automated Biometric Identification System
Automated Case System
American Standard Code for Information Interchange
Consular Consolidated Database
Consular Lost and Stolen Passports
Consular Lookout and Support System
Central Intelligence Agency
Central Record System
Consular Shared Tables
Defense Advanced Research Projects Agency
Department of Homeland Security
Department of Justice
Department of State
Oracle Relational Database Management Systems
Diversity Visa Information System
Electronic Case File
Electronic Surveillance Data Management System
Federal Bureau of Investigation
Financial Crimes Enforcement Network
Foreign Intelligence Surveillance Act
Freedom of Information Act
Foreign Terrorist Tracking Task Force
Government Accountability Office
Investigative Data Warehouse
Intelligence Information Reports
Information Sharing Policy Group
Joint Intelligence Committee Inquiry
Joint Terrorism Task Force
Network Attached Storage
National Crime Information Center
National Security Analysis Center
National Security Branch of the FBI
National Security Letter
Office of Congressional Affairs of the FBI
Optical Character Recognition
Office of the General Counsel
Office of the Inspector General
Online Passport Lost & Stolen System
Privacy Impact Assessment
Passport Information Electronic Records System
Query Tracking and Initiation Program
Records Analysis Management System
Secure Counter-Terrorism/Collaboration Operational Prototype Environment
Sensitive Compartmented Information
Student and Exchange Visitor Information System
Special Projects Team
Special Technologies and Applications Section
Treasury Enforcement Communication System
Terrorist Financing Operations Section
Translingual Information Detection, Extraction and Summarization
Tagged Image File Format
Transportation Security Administration
Terrorist Screening Center
Terrorist Screening Database
Terrorist Watch List
Terrorist Watch and Warning Unit
Universal Name Index
United States Visitor and Immigrant Status Indicator Technology
Uniting and Strengthening America by Providing Appropriate Tools Required to Intercept and Obstruct Terrorism
Virtual Case File
Violent Gang and Terrorist Organization File
Visa Opinion Information Service
Waiver Review System